Duplicate detection - step 3: remove true duplicates¶
This notebook runs the third part of the duplicate detection algorithm on a dataframe with the following columns:
archiveType(used for duplicate detection algorithm)dataSetNamedatasetIdgeo_meanElev(used for duplicate detection algorithm)geo_meanLat(used for duplicate detection algorithm)geo_meanLon(used for duplicate detection algorithm)geo_siteName(used for duplicate detection algorithm)interpretation_directioninterpretation_seasonalityinterpretation_variableinterpretation_variableDetailsoriginalDataURLoriginalDatabasepaleoData_notespaleoData_proxy(used for duplicate detection algorithm)paleoData_unitspaleoData_values(used for duplicate detection algorithm, test for correlation, RMSE, correlation of 1st difference, RMSE of 1st difference)paleoData_variableNameyear(used for duplicate detection algorithm)yearUnits- This interactive notebook (
dup_removal.ipynb) removes the duplicates flagged indup_detection.ipynb, following the decisions made indup_decision.ipynb. The decisions include - removal of redundant duplicates
- creation of composites
Based on the operator decisions as specified in data/DATABASENAME/duplicate_detection/duplicate_decisions_DATABASENAME_AUTHORINITIALS_YY-MM-DD.csv.
Ultimately a duplicate free dataframe is saved under
data/DATABASENAME/DATABASENAME_dupfree.pkldata/DATABASENAME/DATABASENAME_dupfree_data.csvdata/DATABASENAME/DATABASENAME_dupfree_year.csvdata/DATABASENAME/DATABASENAME_dupfree_metadata.csv
10/11/2025 by LL: tidied up with revised data organisation and prepared for documentation 02/12/2024 by LL: Modified the compositing process for metadata to fix bugs and make it more user friendly. Added some extra information to the bottom of the file (prior to the figures).
22/10/2024 by LL: add the composite option for duplicates (create z-scores and average over shared time period) 30/09/2024 by LL: keep all original database values for removeed duplicates with more than one original database
Author: Lucie Luecke, created 27/9/2024
Intialisation¶
Set up working environment¶
Make sure the repo_root is added correctly, it should be: your_root_dir/dod2k This should be the working directory throughout this notebook (and all other notebooks).
%load_ext autoreload
%autoreload 2
import sys
import os
from pathlib import Path
# Add parent directory to path (works from any notebook in notebooks/)
# the repo_root should be the parent directory of the notebooks folder
current_dir = Path().resolve()
# Determine repo root
if current_dir.name == 'dod2k': repo_root = current_dir
elif current_dir.parent.name == 'dod2k': repo_root = current_dir.parent
else: raise Exception('Please review the repo root structure (see first cell).')
# Update cwd and path only if needed
if os.getcwd() != str(repo_root):
os.chdir(repo_root)
if str(repo_root) not in sys.path:
sys.path.insert(0, str(repo_root))
print(f"Repo root: {repo_root}")
if str(os.getcwd())==str(repo_root):
print(f"Working directory matches repo root. ")
Repo root: /home/jupyter-lluecke/dod2k_v2.0/dod2k Working directory matches repo root.
import pandas as pd
import numpy as np
import datetime
from dod2k_utilities import ut_functions as utf # contains utility functions
from dod2k_utilities import ut_duplicate_search as dup # contains utility functions
Load dataset¶
Define the dataset which needs to be screened for duplicates. Input files for the duplicate detection mechanism need to be compact dataframes (pandas dataframes with standardised columns and entry formatting).
The function load_compact_dataframe_from_csv loads the dataframe from a csv file from data\DB\, with DB the name of the database. The database name (db_name) can be
pages2kch2kiso2ksisalfe23
for the individual databases, or
all_merged
to load the merged database of all individual databases, or can be any user defined compact dataframe.
# load dataframe
db_name='all_merged'
# db_name='dup_test'
df = utf.load_compact_dataframe_from_csv(db_name)
print(df.info())
df.name = db_name
<class 'pandas.core.frame.DataFrame'> RangeIndex: 5320 entries, 0 to 5319 Data columns (total 21 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 5320 non-null object 1 dataSetName 5320 non-null object 2 datasetId 5320 non-null object 3 geo_meanElev 5221 non-null float32 4 geo_meanLat 5320 non-null float32 5 geo_meanLon 5320 non-null float32 6 geo_siteName 5320 non-null object 7 interpretation_direction 5320 non-null object 8 interpretation_seasonality 5320 non-null object 9 interpretation_variable 5320 non-null object 10 interpretation_variableDetail 5320 non-null object 11 originalDataURL 5320 non-null object 12 originalDatabase 5320 non-null object 13 paleoData_notes 5320 non-null object 14 paleoData_proxy 5320 non-null object 15 paleoData_sensorSpecies 5320 non-null object 16 paleoData_units 5320 non-null object 17 paleoData_values 5320 non-null object 18 paleoData_variableName 5320 non-null object 19 year 5320 non-null object 20 yearUnits 5320 non-null object dtypes: float32(3), object(18) memory usage: 810.6+ KB None
Set datasetId as dataframe index to reliably identify the duplicates:
df.set_index('datasetId', inplace = True)
df['datasetId']=df.index
Input operator's credentials¶
In order to keep maximum transparency and reproduceability, put in the operator's credentials here.
These details are used to flag the intermediate output files and provided along with the final duplicate free dataset.
initials = 'LL'
fullname = 'Lucie Luecke'
email = 'ljluec1@st-andrews.ac.uk'
# initials = 'MNE'
# fullname = 'Michael Evans'
# email = 'mnevans@umd.edu'
operator_details = [initials, fullname, email]
Apply duplicate decisions to dataframe¶
Load duplicate decisions from csv¶
Please specify the date of the decision process below. The decision output file is then loaded from data/DBNAME/dup_detection/dup_decisions_DBNAME_INITIALS_DATE.csv.
# date = str(datetime.datetime.utcnow())[2:10]
# date='25-12-11' # for Lucie's dup_test decisions
# date='25-12-07' # for Mike's latest all_merged decisions
date='25-12-11' # for Lucie's latest all_merged decisions
filename = f'data/{df.name}/dup_detection/dup_decisions_{df.name}_{initials}_{date}'
data, header = dup.read_csv(filename, header=True)
df_decisions = pd.read_csv(filename+'.csv', header=5)
for hh in header:
print(hh)
print(df_decisions.columns)
print(df.name)
Decisions for duplicate candidate pairs.
Operated by Lucie Luecke (LL)
E-Mail: ljluec1@st-andrews.ac.uk
Created on: 2025-12-11 12:57:05.712619 (UTC)
test
Index(['index 1', 'index 2', 'figure path', 'datasetId 1', 'datasetId 2',
'originalDatabase 1', 'originalDatabase 2', 'geo_siteName 1',
'geo_siteName 2', 'geo_meanLat 1', 'geo_meanLat 2', 'geo_meanLon 1',
'geo_meanLon 2', 'geo_meanElevation 1', 'geo_meanElevation 2',
'archiveType 1', 'archiveType 2', 'paleoData_proxy 1',
'paleoData_proxy 2', 'originalDataURL 1', 'originalDataURL 2', 'year 1',
'year 2', 'Decision 1', 'Decision 2', 'Decision type',
'Decision comment'],
dtype='object')
all_merged
Save a list of all candidate IDs individually (not as pairs) and collect the associated decisions.
# all candidate IDs
candidate_IDs = list(df_decisions['datasetId 1'])
candidate_IDs += list(df_decisions['datasetId 2'])
candidate_IDs = np.unique(candidate_IDs)
# decisions
decisions = {}
for ind in df_decisions.index:
id1, id2 = df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
dec1, dec2 = df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
for id, dec in zip([id1, id2], [dec1, dec2]):
if id not in decisions: decisions[id] = []
decisions[id]+=[dec]
Show the decisions associated with each individual record
for rr, dd in decisions.items():
print(rr, dd)
pages2k_0 ['REMOVE', 'REMOVE', 'REMOVE'] iso2k_296 ['KEEP', 'KEEP', 'KEEP'] iso2k_298 ['KEEP', 'REMOVE', 'KEEP'] iso2k_299 ['KEEP', 'REMOVE', 'REMOVE'] pages2k_6 ['REMOVE'] FE23_northamerica_usa_az555 ['KEEP'] pages2k_50 ['REMOVE'] FE23_northamerica_canada_cana091 ['KEEP'] pages2k_62 ['KEEP'] pages2k_63 ['KEEP'] pages2k_81 ['REMOVE', 'REMOVE'] ch2k_HE08LRA01_76 ['KEEP', 'REMOVE'] iso2k_1813 ['KEEP', 'KEEP'] pages2k_83 ['REMOVE'] iso2k_1916 ['KEEP'] pages2k_85 ['KEEP'] pages2k_88 ['REMOVE'] pages2k_94 ['REMOVE'] FE23_northamerica_canada_cana153 ['KEEP'] pages2k_107 ['REMOVE'] FE23_northamerica_usa_ak046 ['KEEP'] pages2k_121 ['KEEP'] pages2k_122 ['REMOVE'] pages2k_132 ['REMOVE'] FE23_northamerica_canada_cana225 ['KEEP'] pages2k_158 ['REMOVE'] FE23_northamerica_usa_wa069 ['KEEP'] pages2k_171 ['REMOVE'] FE23_northamerica_usa_wy021 ['KEEP'] pages2k_203 ['REMOVE'] iso2k_826 ['KEEP'] pages2k_225 ['REMOVE'] FE23_northamerica_usa_nv512 ['KEEP', 'REMOVE'] pages2k_238 ['REMOVE'] iso2k_1044 ['KEEP'] pages2k_242 ['REMOVE', 'REMOVE'] ch2k_LI06FIJ01_582 ['KEEP', 'REMOVE'] iso2k_353 ['KEEP', 'KEEP'] pages2k_258 ['REMOVE'] iso2k_1498 ['KEEP'] pages2k_263 ['REMOVE'] iso2k_1322 ['KEEP'] pages2k_267 ['REMOVE', 'REMOVE'] iso2k_58 ['KEEP', 'REMOVE'] iso2k_1068 ['KEEP', 'KEEP'] pages2k_271 ['REMOVE', 'REMOVE'] ch2k_FE18RUS01_492 ['KEEP', 'REMOVE'] iso2k_1861 ['KEEP', 'KEEP'] pages2k_273 ['REMOVE'] FE23_asia_russ130w ['KEEP'] pages2k_281 ['REMOVE'] FE23_northamerica_canada_cana155 ['KEEP'] pages2k_294 ['REMOVE'] FE23_northamerica_usa_ak021 ['KEEP'] pages2k_305 ['KEEP'] pages2k_309 ['REMOVE'] pages2k_307 ['REMOVE'] pages2k_311 ['KEEP'] pages2k_315 ['REMOVE'] iso2k_362 ['KEEP'] pages2k_317 ['REMOVE', 'REMOVE'] ch2k_NA09MAL01_84 ['KEEP', 'REMOVE'] iso2k_1754 ['KEEP', 'KEEP'] pages2k_323 ['REMOVE'] FE23_northamerica_canada_cana210 ['KEEP'] pages2k_385 ['REMOVE', 'REMOVE'] ch2k_FE09OGA01_304 ['KEEP', 'REMOVE'] iso2k_1922 ['KEEP', 'KEEP'] pages2k_387 ['REMOVE'] ch2k_FE09OGA01_306 ['KEEP'] pages2k_395 ['REMOVE', 'REMOVE'] ch2k_CA07FLI01_400 ['KEEP', 'REMOVE'] iso2k_1057 ['KEEP', 'KEEP'] pages2k_397 ['REMOVE'] ch2k_CA07FLI01_402 ['KEEP'] pages2k_409 ['REMOVE', 'REMOVE'] ch2k_QU96ESV01_422 ['KEEP', 'REMOVE'] iso2k_218 ['KEEP', 'KEEP'] pages2k_414 ['KEEP'] pages2k_418 ['REMOVE'] pages2k_417 ['KEEP'] pages2k_421 ['REMOVE'] pages2k_427 ['KEEP'] pages2k_433 ['REMOVE'] pages2k_435 ['KEEP'] pages2k_842 ['REMOVE'] pages2k_444 ['KEEP', 'KEEP'] pages2k_445 ['REMOVE', 'KEEP'] pages2k_446 ['REMOVE', 'REMOVE'] pages2k_462 ['REMOVE', 'REMOVE'] ch2k_OS14UCP01_236 ['KEEP', 'REMOVE'] iso2k_350 ['KEEP', 'KEEP'] pages2k_468 ['KEEP', 'REMOVE'] pages2k_3550 ['REMOVE', 'REMOVE'] FE23_asia_russ137w ['KEEP', 'KEEP'] pages2k_472 ['KEEP', 'KEEP'] pages2k_474 ['REMOVE', 'KEEP'] pages2k_477 ['KEEP', 'REMOVE'] pages2k_478 ['REMOVE'] iso2k_1846 ['KEEP'] pages2k_486 ['REMOVE'] FE23_northamerica_usa_ca609 ['KEEP'] pages2k_495 ['REMOVE', 'REMOVE'] ch2k_LI06RAR01_12 ['KEEP', 'REMOVE'] iso2k_1502 ['KEEP', 'KEEP'] pages2k_500 ['REMOVE', 'REMOVE'] ch2k_AS05GUA01_302 ['KEEP', 'REMOVE'] iso2k_1559 ['KEEP', 'KEEP'] pages2k_541 ['KEEP'] iso2k_404 ['REMOVE'] pages2k_543 ['KEEP'] pages2k_976 ['REMOVE'] pages2k_565 ['REMOVE'] iso2k_998 ['KEEP'] pages2k_583 ['REMOVE'] FE23_northamerica_usa_mt116 ['KEEP'] pages2k_592 ['REMOVE', 'REMOVE'] ch2k_LI06RAR02_270 ['KEEP', 'REMOVE'] iso2k_1500 ['KEEP', 'KEEP'] pages2k_610 ['REMOVE'] iso2k_1199 ['KEEP'] pages2k_626 ['REMOVE'] FE23_northamerica_usa_wa071 ['KEEP'] pages2k_691 ['REMOVE'] FE23_northamerica_canada_cana062 ['KEEP'] pages2k_730 ['REMOVE'] iso2k_396 ['KEEP'] pages2k_736 ['REMOVE'] FE23_northamerica_usa_wy024 ['KEEP'] pages2k_800 ['REMOVE'] FE23_northamerica_canada_cana234 ['KEEP'] pages2k_818 ['REMOVE'] iso2k_488 ['KEEP'] pages2k_827 ['KEEP'] pages2k_830 ['REMOVE'] pages2k_831 ['KEEP', 'REMOVE'] pages2k_2220 ['REMOVE', 'REMOVE'] FE23_asia_russ127w ['KEEP', 'KEEP'] pages2k_857 ['REMOVE'] FE23_northamerica_usa_ut511 ['KEEP'] pages2k_881 ['REMOVE'] iso2k_1010 ['KEEP', 'KEEP'] pages2k_893 ['KEEP', 'KEEP'] pages2k_895 ['REMOVE', 'KEEP'] pages2k_900 ['REMOVE', 'REMOVE'] pages2k_940 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_DR99ABR01_264 ['KEEP', 'KEEP', 'REMOVE'] ch2k_DR99ABR01_266 ['KEEP', 'REMOVE', 'REMOVE'] iso2k_91 ['KEEP', 'KEEP', 'KEEP'] pages2k_945 ['REMOVE'] iso2k_100 ['KEEP'] pages2k_960 ['REMOVE'] iso2k_641 ['KEEP'] pages2k_982 ['REMOVE'] FE23_northamerica_usa_or042 ['KEEP'] pages2k_1004 ['REMOVE'] iso2k_644 ['KEEP'] pages2k_1026 ['REMOVE'] FE23_northamerica_usa_az553 ['KEEP'] pages2k_1048 ['REMOVE'] iso2k_1212 ['KEEP'] pages2k_1089 ['REMOVE', 'REMOVE'] FE23_northamerica_usa_mt112 ['KEEP', 'COMPOSITE'] FE23_northamerica_usa_mt113 ['KEEP', 'COMPOSITE'] pages2k_1108 ['REMOVE'] iso2k_1060 ['KEEP'] pages2k_1116 ['REMOVE'] FE23_northamerica_canada_cana170w ['KEEP'] pages2k_1147 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_DA06MAF01_78 ['KEEP', 'REMOVE'] ch2k_DA06MAF02_104 ['KEEP', 'REMOVE'] iso2k_1748 ['KEEP', 'KEEP', 'KEEP'] pages2k_1153 ['KEEP', 'KEEP'] pages2k_1156 ['REMOVE', 'KEEP'] pages2k_1160 ['REMOVE', 'REMOVE'] pages2k_1209 ['REMOVE'] FE23_northamerica_usa_co553 ['KEEP'] pages2k_1252 ['REMOVE'] FE23_northamerica_canada_cana096 ['KEEP'] pages2k_1274 ['REMOVE'] iso2k_1577 ['KEEP'] pages2k_1293 ['REMOVE'] iso2k_821 ['KEEP'] pages2k_1325 ['REMOVE'] FE23_northamerica_usa_wy030 ['KEEP'] pages2k_1360 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_UR00MAI01_22 ['KEEP', 'REMOVE', 'REMOVE'] iso2k_94 ['KEEP', 'KEEP', 'KEEP'] iso2k_98 ['KEEP', 'KEEP', 'REMOVE'] pages2k_1362 ['KEEP'] pages2k_1365 ['REMOVE'] pages2k_1370 ['REMOVE'] iso2k_1619 ['KEEP'] pages2k_1420 ['REMOVE'] FE23_northamerica_canada_cana111 ['KEEP'] pages2k_1442 ['KEEP'] pages2k_1444 ['REMOVE'] pages2k_1488 ['REMOVE', 'REMOVE', 'KEEP', 'REMOVE'] pages2k_1628 ['KEEP', 'KEEP', 'KEEP', 'KEEP'] ch2k_NU11PAL01_52 ['KEEP', 'REMOVE', 'KEEP', 'REMOVE'] iso2k_505 ['REMOVE', 'REMOVE', 'KEEP', 'REMOVE'] iso2k_579 ['KEEP', 'REMOVE', 'KEEP', 'REMOVE'] pages2k_1490 ['REMOVE'] ch2k_NU11PAL01_54 ['KEEP'] pages2k_1491 ['REMOVE'] iso2k_575 ['KEEP'] pages2k_1497 ['REMOVE'] iso2k_1885 ['KEEP'] pages2k_1515 ['KEEP'] pages2k_1519 ['REMOVE'] pages2k_1520 ['REMOVE'] pages2k_1522 ['KEEP'] pages2k_1547 ['REMOVE'] iso2k_259 ['KEEP'] pages2k_1566 ['REMOVE'] FE23_northamerica_canada_cana231 ['KEEP'] pages2k_1605 ['REMOVE'] FE23_northamerica_usa_ca606 ['KEEP'] pages2k_1619 ['REMOVE'] pages2k_1623 ['KEEP'] pages2k_1636 ['REMOVE'] FE23_northamerica_usa_wa081 ['KEEP'] pages2k_1686 ['KEEP'] pages2k_1688 ['KEEP'] pages2k_1692 ['REMOVE'] FE23_asia_mong012 ['KEEP'] pages2k_1703 ['REMOVE', 'REMOVE'] ch2k_MO06PED01_226 ['KEEP', 'REMOVE'] iso2k_629 ['KEEP', 'KEEP'] pages2k_1712 ['REMOVE'] iso2k_715 ['KEEP'] pages2k_1720 ['REMOVE'] iso2k_1579 ['KEEP'] pages2k_1741 ['REMOVE'] FE23_northamerica_usa_wa104 ['KEEP'] pages2k_1750 ['REMOVE', 'REMOVE'] iso2k_1856 ['KEEP', 'KEEP'] sisal_294.0_194 ['KEEP', 'REMOVE'] pages2k_1771 ['REMOVE'] ch2k_TU01LAI01_192 ['KEEP'] pages2k_1804 ['REMOVE'] FE23_northamerica_usa_me010 ['KEEP'] pages2k_1859 ['REMOVE', 'REMOVE'] ch2k_HE10GUA01_244 ['KEEP', 'REMOVE'] iso2k_1735 ['KEEP', 'KEEP'] pages2k_1861 ['REMOVE'] ch2k_HE10GUA01_246 ['KEEP'] pages2k_1880 ['REMOVE'] FE23_northamerica_usa_ak060 ['KEEP'] pages2k_1891 ['REMOVE'] pages2k_1893 ['KEEP'] pages2k_1918 ['REMOVE'] iso2k_102 ['KEEP'] pages2k_1920 ['KEEP'] pages2k_1923 ['REMOVE'] pages2k_1932 ['REMOVE'] pages2k_1934 ['KEEP'] pages2k_1942 ['REMOVE', 'REMOVE'] ch2k_ZI04IFR01_26 ['KEEP', 'REMOVE'] iso2k_257 ['KEEP', 'KEEP'] pages2k_1972 ['KEEP'] pages2k_1973 ['REMOVE'] pages2k_1976 ['REMOVE'] pages2k_1980 ['KEEP'] pages2k_1978 ['REMOVE'] pages2k_1983 ['KEEP'] pages2k_1985 ['REMOVE'] iso2k_1294 ['KEEP'] pages2k_1989 ['KEEP'] pages2k_1991 ['REMOVE'] pages2k_1994 ['REMOVE'] ch2k_DE12ANC01_258 ['KEEP'] pages2k_2013 ['REMOVE'] FE23_northamerica_canada_cana097 ['KEEP'] pages2k_2042 ['REMOVE', 'REMOVE'] ch2k_TU95MAD01_24 ['KEEP', 'REMOVE'] iso2k_20 ['KEEP', 'KEEP'] pages2k_2059 ['REMOVE'] FE23_northamerica_usa_ak058 ['KEEP'] pages2k_2085 ['REMOVE'] FE23_northamerica_canada_cana002 ['KEEP'] pages2k_2094 ['REMOVE', 'REMOVE'] ch2k_TU01DEP01_450 ['KEEP', 'REMOVE'] iso2k_1201 ['KEEP', 'KEEP'] pages2k_2098 ['KEEP'] pages2k_2103 ['KEEP'] pages2k_2110 ['REMOVE'] FE23_northamerica_usa_co554 ['KEEP'] pages2k_2146 ['KEEP', 'KEEP'] pages2k_2149 ['KEEP', 'KEEP'] pages2k_2150 ['REMOVE', 'REMOVE'] pages2k_2156 ['REMOVE'] FE23_northamerica_canada_cana169w ['KEEP'] pages2k_2214 ['REMOVE'] iso2k_1631 ['KEEP'] pages2k_2226 ['KEEP'] FE23_asia_mong007w ['REMOVE'] pages2k_2265 ['REMOVE'] FE23_northamerica_usa_ak070 ['KEEP'] pages2k_2287 ['KEEP'] pages2k_2290 ['REMOVE'] pages2k_2300 ['REMOVE'] ch2k_OS14RIP01_174 ['KEEP'] pages2k_2303 ['KEEP'] FE23_asia_mong006 ['REMOVE'] pages2k_2309 ['REMOVE'] ch2k_WE09ARR01_208 ['KEEP'] pages2k_2311 ['REMOVE'] ch2k_WE09ARR01_210 ['KEEP'] pages2k_2319 ['REMOVE'] FE23_northamerica_usa_ak6 ['KEEP'] pages2k_2339 ['REMOVE'] pages2k_2344 ['KEEP'] pages2k_2361 ['REMOVE'] FE23_northamerica_usa_wa097 ['KEEP'] pages2k_2402 ['REMOVE'] FE23_northamerica_usa_co586 ['KEEP'] pages2k_2430 ['REMOVE'] FE23_northamerica_canada_cana113 ['KEEP'] pages2k_2473 ['REMOVE'] FE23_northamerica_usa_wy022 ['KEEP'] pages2k_2500 ['REMOVE'] pages2k_2502 ['KEEP'] pages2k_2510 ['REMOVE'] iso2k_1626 ['KEEP'] pages2k_2514 ['REMOVE'] iso2k_1467 ['KEEP'] pages2k_2517 ['REMOVE'] iso2k_1130 ['KEEP'] pages2k_2534 ['REMOVE'] iso2k_1575 ['KEEP'] pages2k_2538 ['REMOVE'] iso2k_1862 ['KEEP'] pages2k_2561 ['REMOVE'] FE23_northamerica_canada_cana094 ['KEEP'] pages2k_2592 ['REMOVE'] pages2k_2596 ['KEEP'] pages2k_2595 ['REMOVE'] pages2k_2599 ['KEEP'] pages2k_2604 ['KEEP', 'REMOVE'] pages2k_2606 ['REMOVE', 'REMOVE'] iso2k_1481 ['KEEP', 'KEEP'] pages2k_2607 ['KEEP', 'KEEP'] pages2k_2609 ['REMOVE', 'KEEP'] pages2k_2612 ['REMOVE', 'REMOVE'] pages2k_2613 ['REMOVE'] iso2k_1470 ['KEEP'] pages2k_2617 ['REMOVE'] iso2k_1573 ['KEEP'] pages2k_2634 ['REMOVE'] FE23_northamerica_usa_id013 ['KEEP'] pages2k_2660 ['REMOVE'] FE23_northamerica_usa_ak014 ['KEEP'] pages2k_2677 ['REMOVE'] FE23_northamerica_usa_wy023 ['KEEP'] pages2k_2703 ['REMOVE'] FE23_northamerica_usa_ak094 ['KEEP'] pages2k_2722 ['REMOVE'] FE23_northamerica_canada_cana238 ['KEEP'] pages2k_2750 ['REMOVE'] iso2k_1708 ['KEEP'] pages2k_2752 ['KEEP', 'KEEP'] pages2k_2755 ['REMOVE', 'KEEP'] pages2k_2759 ['REMOVE', 'REMOVE'] pages2k_2793 ['KEEP'] pages2k_2795 ['REMOVE', 'KEEP'] pages2k_2798 ['KEEP', 'REMOVE'] pages2k_2796 ['KEEP'] pages2k_2830 ['REMOVE'] FE23_northamerica_mexico_mexi020 ['KEEP'] pages2k_2843 ['REMOVE'] FE23_northamerica_usa_wa083 ['KEEP'] pages2k_2899 ['REMOVE'] pages2k_2901 ['KEEP'] pages2k_2904 ['KEEP'] pages2k_2906 ['REMOVE'] pages2k_2922 ['REMOVE'] FE23_northamerica_usa_ca603 ['KEEP'] pages2k_2953 ['REMOVE'] iso2k_573 ['KEEP'] pages2k_2959 ['REMOVE'] FE23_northamerica_mexico_mexi043 ['KEEP'] pages2k_2976 ['REMOVE'] FE23_northamerica_usa_id008 ['KEEP'] pages2k_3002 ['REMOVE'] FE23_northamerica_usa_or043 ['KEEP'] pages2k_3028 ['KEEP', 'KEEP'] pages2k_3030 ['REMOVE', 'KEEP'] pages2k_3033 ['REMOVE', 'REMOVE'] pages2k_3038 ['REMOVE'] FE23_northamerica_usa_mt108 ['KEEP'] pages2k_3064 ['REMOVE'] iso2k_698 ['KEEP'] pages2k_3068 ['REMOVE', 'REMOVE'] ch2k_ZI14IFR02_522 ['KEEP', 'KEEP'] ch2k_ZI14IFR02_524 ['KEEP', 'REMOVE'] pages2k_3085 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_KU00NIN01_150 ['KEEP', 'REMOVE', 'KEEP'] iso2k_1554 ['KEEP', 'KEEP', 'REMOVE'] iso2k_1556 ['KEEP', 'REMOVE', 'KEEP'] pages2k_3107 ['REMOVE'] FE23_northamerica_usa_co552 ['KEEP', 'KEEP'] pages2k_3108 ['REMOVE'] pages2k_3132 ['REMOVE', 'REMOVE'] ch2k_QU06RAB01_144 ['KEEP', 'REMOVE'] iso2k_1311 ['KEEP', 'KEEP'] pages2k_3134 ['REMOVE'] ch2k_QU06RAB01_146 ['KEEP'] pages2k_3170 ['REMOVE'] FE23_australia_newz062 ['KEEP', 'KEEP'] pages2k_3179 ['REMOVE'] FE23_northamerica_usa_ak057 ['KEEP'] pages2k_3188 ['KEEP'] pages2k_3191 ['REMOVE'] pages2k_3196 ['KEEP'] FE23_asia_mong011 ['REMOVE'] pages2k_3202 ['REMOVE'] iso2k_1727 ['KEEP'] pages2k_3234 ['KEEP', 'KEEP'] pages2k_3236 ['REMOVE', 'KEEP'] pages2k_3239 ['REMOVE', 'REMOVE'] pages2k_3243 ['REMOVE'] iso2k_0 ['KEEP'] pages2k_3263 ['REMOVE'] iso2k_1264 ['KEEP'] pages2k_3266 ['REMOVE', 'REMOVE'] ch2k_GO12SBV01_396 ['KEEP', 'REMOVE'] iso2k_870 ['KEEP', 'KEEP'] pages2k_3307 ['REMOVE'] iso2k_339 ['KEEP'] pages2k_3313 ['REMOVE'] FE23_northamerica_usa_ca560 ['KEEP'] pages2k_3337 ['KEEP'] pages2k_3342 ['REMOVE'] pages2k_3352 ['REMOVE', 'REMOVE', 'REMOVE'] ch2k_ZI14TUR01_480 ['KEEP', 'KEEP', 'REMOVE'] ch2k_ZI14TUR01_482 ['KEEP', 'REMOVE', 'REMOVE'] iso2k_302 ['KEEP', 'KEEP', 'KEEP'] pages2k_3372 ['REMOVE', 'REMOVE'] ch2k_KI04MCV01_366 ['KEEP', 'REMOVE'] iso2k_155 ['KEEP', 'KEEP'] pages2k_3374 ['REMOVE'] ch2k_KI04MCV01_368 ['KEEP'] pages2k_3404 ['REMOVE'] FE23_northamerica_canada_cana029 ['KEEP'] pages2k_3417 ['KEEP'] pages2k_3419 ['REMOVE'] pages2k_3503 ['REMOVE'] FE23_northamerica_usa_wa072 ['KEEP'] pages2k_3524 ['REMOVE'] FE23_northamerica_usa_ak010 ['KEEP'] pages2k_3552 ['REMOVE'] iso2k_1581 ['KEEP'] pages2k_3554 ['REMOVE', 'REMOVE'] ch2k_LI94SEC01_436 ['KEEP', 'REMOVE'] iso2k_1124 ['KEEP', 'KEEP'] pages2k_3571 ['REMOVE'] iso2k_174 ['KEEP'] pages2k_3583 ['REMOVE'] FE23_northamerica_usa_co633 ['KEEP'] pages2k_3599 ['REMOVE', 'REMOVE'] iso2k_1069 ['KEEP', 'KEEP'] iso2k_1660 ['KEEP', 'KEEP'] pages2k_3609 ['REMOVE'] FE23_northamerica_canada_cana053 ['KEEP'] pages2k_3631 ['REMOVE'] iso2k_1530 ['KEEP'] pages2k_3642 ['REMOVE'] FE23_northamerica_usa_wy025 ['KEEP'] FE23_southamerica_arge016 ['KEEP'] FE23_southamerica_arge085 ['REMOVE'] FE23_northamerica_canada_cana100 ['REMOVE'] FE23_northamerica_canada_cana213 ['KEEP'] FE23_northamerica_canada_cana105 ['REMOVE'] FE23_northamerica_canada_cana217 ['KEEP'] FE23_northamerica_canada_cana116 ['KEEP'] FE23_northamerica_canada_cana168w ['REMOVE'] FE23_northamerica_canada_cana161 ['KEEP'] FE23_northamerica_canada_cana162 ['REMOVE'] FE23_southamerica_chil016 ['KEEP'] FE23_southamerica_chil017 ['REMOVE'] FE23_europe_swed019w ['KEEP'] FE23_europe_swed021w ['REMOVE'] FE23_northamerica_mexico_mexi022 ['REMOVE'] FE23_northamerica_mexico_mexi023 ['KEEP'] FE23_australia_newz003 ['REMOVE'] FE23_australia_newz060 ['KEEP'] FE23_australia_newz008 ['KEEP'] FE23_australia_newz092 ['REMOVE'] FE23_australia_newz014 ['REMOVE'] FE23_australia_newz061 ['KEEP'] FE23_australia_newz018 ['REMOVE'] FE23_australia_newz019 ['REMOVE'] FE23_australia_newz063 ['KEEP'] FE23_northamerica_usa_ca066 ['REMOVE'] FE23_northamerica_usa_ca628 ['KEEP', 'KEEP'] FE23_northamerica_usa_ca067 ['REMOVE'] FE23_northamerica_usa_ca512 ['REMOVE'] FE23_northamerica_usa_ca613 ['KEEP'] FE23_northamerica_usa_ca535 ['REMOVE'] FE23_northamerica_usa_ca670 ['KEEP'] FE23_northamerica_usa_me017 ['COMPOSITE'] FE23_northamerica_usa_me018 ['COMPOSITE'] FE23_northamerica_usa_mo ['KEEP'] FE23_northamerica_usa_mo009 ['REMOVE'] FE23_northamerica_usa_nj001 ['COMPOSITE'] FE23_northamerica_usa_nj002 ['COMPOSITE'] FE23_northamerica_usa_nm024 ['KEEP'] FE23_northamerica_usa_nm055 ['REMOVE'] FE23_northamerica_usa_nv060 ['REMOVE'] FE23_northamerica_usa_nv518 ['KEEP'] FE23_northamerica_usa_nv521 ['KEEP'] FE23_northamerica_usa_nv513 ['REMOVE'] FE23_northamerica_usa_nv520 ['KEEP'] ch2k_ZI15MER01_2 ['KEEP'] ch2k_ZI15MER01_4 ['REMOVE'] ch2k_CO03PAL03_6 ['REMOVE'] iso2k_511 ['KEEP'] ch2k_CO03PAL02_8 ['REMOVE'] iso2k_509 ['KEEP'] ch2k_CO03PAL07_14 ['REMOVE'] iso2k_521 ['KEEP'] ch2k_RE18CAY01_30 ['REMOVE'] iso2k_917 ['KEEP'] ch2k_KU99HOU01_40 ['REMOVE', 'KEEP'] iso2k_786 ['KEEP', 'KEEP'] iso2k_788 ['REMOVE', 'REMOVE'] ch2k_CA14TIM01_64 ['REMOVE'] iso2k_473 ['KEEP'] ch2k_SW98STP01_86 ['REMOVE'] iso2k_50 ['KEEP'] ch2k_CO03PAL01_110 ['REMOVE'] iso2k_507 ['KEEP'] ch2k_CH98PIR01_116 ['REMOVE'] iso2k_1229 ['KEEP'] ch2k_XI17HAI01_128 ['KEEP', 'REMOVE'] ch2k_XI17HAI01_136 ['REMOVE', 'REMOVE'] iso2k_1762 ['KEEP', 'KEEP'] ch2k_XI17HAI01_130 ['KEEP'] ch2k_XI17HAI01_134 ['REMOVE'] ch2k_DE14DTO03_140 ['KEEP'] ch2k_DE14DTO01_148 ['KEEP'] ch2k_EV18ROC01_184 ['KEEP'] ch2k_EV18ROC01_186 ['REMOVE'] ch2k_CA13SAP01_188 ['REMOVE'] iso2k_569 ['KEEP'] ch2k_HE13MIS01_194 ['KEEP', 'KEEP'] iso2k_211 ['KEEP'] iso2k_213 ['REMOVE'] ch2k_ZI15IMP02_200 ['KEEP'] ch2k_ZI15IMP02_202 ['REMOVE'] ch2k_PF04PBA01_204 ['REMOVE', 'REMOVE'] iso2k_1701 ['KEEP', 'KEEP'] iso2k_1704 ['KEEP', 'REMOVE'] ch2k_CO03PAL05_212 ['REMOVE'] iso2k_515 ['KEEP'] ch2k_ZI15TAN01_278 ['KEEP'] ch2k_ZI15TAN01_280 ['REMOVE'] ch2k_GU99NAU01_314 ['KEEP', 'REMOVE'] iso2k_702 ['REMOVE', 'REMOVE'] iso2k_705 ['KEEP', 'KEEP'] ch2k_CO03PAL10_324 ['REMOVE'] iso2k_519 ['KEEP'] ch2k_ZI15IMP01_328 ['KEEP'] ch2k_ZI15IMP01_330 ['REMOVE'] ch2k_RO19YUC01_338 ['KEEP'] ch2k_RO19YUC01_340 ['REMOVE'] ch2k_CO03PAL09_358 ['REMOVE'] iso2k_525 ['KEEP'] ch2k_BA04FIJ02_382 ['REMOVE'] iso2k_52 ['KEEP'] ch2k_CO03PAL06_386 ['REMOVE'] iso2k_517 ['KEEP'] ch2k_CO93TAR01_408 ['REMOVE'] iso2k_539 ['KEEP'] ch2k_CO00MAL01_412 ['REMOVE'] ch2k_DE13HAI01_424 ['KEEP', 'REMOVE'] ch2k_DE13HAI01_432 ['REMOVE', 'REMOVE'] iso2k_1643 ['KEEP', 'KEEP'] ch2k_DE13HAI01_426 ['KEEP'] ch2k_DE13HAI01_430 ['REMOVE'] ch2k_ZI15CLE01_438 ['KEEP'] ch2k_ZI15CLE01_440 ['REMOVE'] ch2k_CO03PAL04_452 ['REMOVE'] iso2k_513 ['KEEP'] ch2k_FL18DTO01_460 ['KEEP'] ch2k_FL18DTO02_554 ['KEEP'] ch2k_DU94URV01_468 ['KEEP'] ch2k_DU94URV01_470 ['REMOVE'] ch2k_CO03PAL08_472 ['REMOVE'] iso2k_523 ['KEEP'] ch2k_LI99CLI01_486 ['REMOVE'] iso2k_1571 ['KEEP'] ch2k_ZI15BUN01_488 ['KEEP'] ch2k_ZI15BUN01_490 ['REMOVE'] ch2k_WU13TON01_504 ['KEEP'] ch2k_WU13TON01_506 ['REMOVE'] ch2k_KI14PAR01_510 ['KEEP'] ch2k_KI14PAR01_518 ['REMOVE'] ch2k_KI14PAR01_512 ['KEEP'] ch2k_KI14PAR01_516 ['REMOVE'] ch2k_BA04FIJ01_558 ['REMOVE'] iso2k_55 ['KEEP'] iso2k_120 ['KEEP'] sisal_253.0_171 ['REMOVE'] iso2k_140 ['KEEP'] sisal_278.0_184 ['REMOVE'] iso2k_236 ['KEEP'] sisal_205.0_141 ['REMOVE'] iso2k_380 ['KEEP'] sisal_446.0_292 ['REMOVE'] iso2k_399 ['KEEP', 'KEEP'] iso2k_806 ['REMOVE', 'KEEP'] iso2k_811 ['REMOVE', 'REMOVE'] iso2k_533 ['KEEP'] sisal_115.0_69 ['KEEP'] iso2k_546 ['KEEP'] iso2k_549 ['REMOVE'] iso2k_547 ['KEEP'] iso2k_550 ['REMOVE'] iso2k_772 ['KEEP'] iso2k_775 ['REMOVE'] iso2k_873 ['KEEP'] sisal_471.0_314 ['REMOVE'] iso2k_1107 ['KEEP', 'KEEP'] iso2k_1817 ['REMOVE', 'KEEP'] sisal_271.0_174 ['REMOVE', 'REMOVE'] iso2k_1178 ['KEEP'] sisal_201.0_133 ['REMOVE'] iso2k_1283 ['KEEP'] iso2k_1286 ['REMOVE'] iso2k_1288 ['KEEP'] sisal_329.0_213 ['REMOVE'] iso2k_1291 ['KEEP'] sisal_330.0_215 ['REMOVE'] iso2k_1495 ['KEEP'] sisal_305.0_199 ['REMOVE'] iso2k_1504 ['KEEP'] sisal_113.0_66 ['REMOVE'] iso2k_1820 ['KEEP'] sisal_272.0_177 ['REMOVE'] iso2k_1823 ['KEEP'] sisal_273.0_179 ['REMOVE'] iso2k_1848 ['KEEP'] iso2k_1855 ['REMOVE'] iso2k_1850 ['KEEP'] iso2k_1851 ['REMOVE'] sisal_46.0_18 ['KEEP'] sisal_47.0_21 ['REMOVE'] sisal_46.0_19 ['KEEP'] sisal_47.0_22 ['REMOVE'] sisal_46.0_20 ['KEEP'] sisal_47.0_23 ['REMOVE'] sisal_430.0_270 ['KEEP'] sisal_896.0_531 ['REMOVE'] sisal_430.0_271 ['KEEP'] sisal_896.0_533 ['REMOVE']
# # for testing tpo check the range of individual decisions, delete later
# ind_decisions = []
# for rr, dd in decisions.items():
# if list(np.sort(dd)) not in ind_decisions:
# ind_decisions.append(list(np.sort(dd)))
# for dd in ind_decisions:
# print(dd)
# print(np.unique([len(dd) for dd in ind_decisions]))
Save all the duplicate details in one dictionary, which will be used in the duplicate free dataframe (final output) df_dupfree to provide details on the duplicate detection process.
dup_details = dup.provide_dup_details(df_decisions, header)
Note that any one record can appear more than once and have multiple decisions associated with it (e.g. 'REMOVE', 'KEEP' or 'COMPOSITE').
In order to remove the duplicates we need to implement the following steps:
- Records to be REMOVED. Remove all records from the dataframe which are associated with the decision 'REMOVE' and save in
df_dupfree_rmv - Records to be COMPOSITED. Create compounds of the records and save in
df_composite - Now check for records which have both 'REMOVE' and 'COMPOSITE' associated. These are potentially remaining duplicates. Here, the operator is once again asked to make decisions and run a 'mini' version of the duplicate workflow.
1. Records to be REMOVED¶
First simply remove all the records to which the decision 'REMOVE' and/or 'COMPOSITE' applies to and store in df_dupfree_rmv, while all 'REMOVE' type records are stored in df_duplica (for later inspection).
# load the records TO BE REMOVED
remove_IDs = list(df_decisions['datasetId 1'][np.isin(df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
remove_IDs += list(df_decisions['datasetId 2'][np.isin(df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
remove_IDs = np.unique(remove_IDs)
df_duplica = df.loc[remove_IDs, 'datasetId'] # df containing only records which were removed
df_dupfree_rmv = df.drop(remove_IDs) # df freed from 'REMOVE' type duplicates
print(f'Removed {len(df_duplica)} REMOVE or COMPOSITE type records.')
print(f'REMOVE type duplicate free dataset contains {len(df_dupfree_rmv)} records.')
print('Removed the following IDs:', remove_IDs)
# if len(df_dupfree_rmv)==0:
# raise Exception(
print(df.name)
# print(df.info())
Removed 356 REMOVE or COMPOSITE type records. REMOVE type duplicate free dataset contains 4964 records. Removed the following IDs: ['FE23_asia_mong006' 'FE23_asia_mong007w' 'FE23_asia_mong011' 'FE23_australia_newz003' 'FE23_australia_newz014' 'FE23_australia_newz018' 'FE23_australia_newz019' 'FE23_australia_newz092' 'FE23_europe_swed021w' 'FE23_northamerica_canada_cana100' 'FE23_northamerica_canada_cana105' 'FE23_northamerica_canada_cana162' 'FE23_northamerica_canada_cana168w' 'FE23_northamerica_mexico_mexi022' 'FE23_northamerica_usa_ca066' 'FE23_northamerica_usa_ca067' 'FE23_northamerica_usa_ca512' 'FE23_northamerica_usa_ca535' 'FE23_northamerica_usa_me017' 'FE23_northamerica_usa_me018' 'FE23_northamerica_usa_mo009' 'FE23_northamerica_usa_mt112' 'FE23_northamerica_usa_mt113' 'FE23_northamerica_usa_nj001' 'FE23_northamerica_usa_nj002' 'FE23_northamerica_usa_nm055' 'FE23_northamerica_usa_nv060' 'FE23_northamerica_usa_nv512' 'FE23_northamerica_usa_nv513' 'FE23_southamerica_arge085' 'FE23_southamerica_chil017' 'ch2k_AS05GUA01_302' 'ch2k_BA04FIJ01_558' 'ch2k_BA04FIJ02_382' 'ch2k_CA07FLI01_400' 'ch2k_CA13SAP01_188' 'ch2k_CA14TIM01_64' 'ch2k_CH98PIR01_116' 'ch2k_CO00MAL01_412' 'ch2k_CO03PAL01_110' 'ch2k_CO03PAL02_8' 'ch2k_CO03PAL03_6' 'ch2k_CO03PAL04_452' 'ch2k_CO03PAL05_212' 'ch2k_CO03PAL06_386' 'ch2k_CO03PAL07_14' 'ch2k_CO03PAL08_472' 'ch2k_CO03PAL09_358' 'ch2k_CO03PAL10_324' 'ch2k_CO93TAR01_408' 'ch2k_DA06MAF01_78' 'ch2k_DA06MAF02_104' 'ch2k_DE13HAI01_424' 'ch2k_DE13HAI01_430' 'ch2k_DE13HAI01_432' 'ch2k_DR99ABR01_264' 'ch2k_DR99ABR01_266' 'ch2k_DU94URV01_470' 'ch2k_EV18ROC01_186' 'ch2k_FE09OGA01_304' 'ch2k_FE18RUS01_492' 'ch2k_GO12SBV01_396' 'ch2k_GU99NAU01_314' 'ch2k_HE08LRA01_76' 'ch2k_HE10GUA01_244' 'ch2k_KI04MCV01_366' 'ch2k_KI14PAR01_516' 'ch2k_KI14PAR01_518' 'ch2k_KU00NIN01_150' 'ch2k_KU99HOU01_40' 'ch2k_LI06FIJ01_582' 'ch2k_LI06RAR01_12' 'ch2k_LI06RAR02_270' 'ch2k_LI94SEC01_436' 'ch2k_LI99CLI01_486' 'ch2k_MO06PED01_226' 'ch2k_NA09MAL01_84' 'ch2k_NU11PAL01_52' 'ch2k_OS14UCP01_236' 'ch2k_PF04PBA01_204' 'ch2k_QU06RAB01_144' 'ch2k_QU96ESV01_422' 'ch2k_RE18CAY01_30' 'ch2k_RO19YUC01_340' 'ch2k_SW98STP01_86' 'ch2k_TU01DEP01_450' 'ch2k_TU95MAD01_24' 'ch2k_UR00MAI01_22' 'ch2k_WU13TON01_506' 'ch2k_XI17HAI01_128' 'ch2k_XI17HAI01_134' 'ch2k_XI17HAI01_136' 'ch2k_ZI04IFR01_26' 'ch2k_ZI14IFR02_524' 'ch2k_ZI14TUR01_480' 'ch2k_ZI14TUR01_482' 'ch2k_ZI15BUN01_490' 'ch2k_ZI15CLE01_440' 'ch2k_ZI15IMP01_330' 'ch2k_ZI15IMP02_202' 'ch2k_ZI15MER01_4' 'ch2k_ZI15TAN01_280' 'iso2k_1286' 'iso2k_1554' 'iso2k_1556' 'iso2k_1704' 'iso2k_1817' 'iso2k_1851' 'iso2k_1855' 'iso2k_213' 'iso2k_298' 'iso2k_299' 'iso2k_404' 'iso2k_505' 'iso2k_549' 'iso2k_550' 'iso2k_579' 'iso2k_58' 'iso2k_702' 'iso2k_775' 'iso2k_788' 'iso2k_806' 'iso2k_811' 'iso2k_98' 'pages2k_0' 'pages2k_1004' 'pages2k_1026' 'pages2k_1048' 'pages2k_107' 'pages2k_1089' 'pages2k_1108' 'pages2k_1116' 'pages2k_1147' 'pages2k_1156' 'pages2k_1160' 'pages2k_1209' 'pages2k_122' 'pages2k_1252' 'pages2k_1274' 'pages2k_1293' 'pages2k_132' 'pages2k_1325' 'pages2k_1360' 'pages2k_1365' 'pages2k_1370' 'pages2k_1420' 'pages2k_1444' 'pages2k_1488' 'pages2k_1490' 'pages2k_1491' 'pages2k_1497' 'pages2k_1519' 'pages2k_1520' 'pages2k_1547' 'pages2k_1566' 'pages2k_158' 'pages2k_1605' 'pages2k_1619' 'pages2k_1636' 'pages2k_1692' 'pages2k_1703' 'pages2k_171' 'pages2k_1712' 'pages2k_1720' 'pages2k_1741' 'pages2k_1750' 'pages2k_1771' 'pages2k_1804' 'pages2k_1859' 'pages2k_1861' 'pages2k_1880' 'pages2k_1891' 'pages2k_1918' 'pages2k_1923' 'pages2k_1932' 'pages2k_1942' 'pages2k_1973' 'pages2k_1976' 'pages2k_1978' 'pages2k_1985' 'pages2k_1991' 'pages2k_1994' 'pages2k_2013' 'pages2k_203' 'pages2k_2042' 'pages2k_2059' 'pages2k_2085' 'pages2k_2094' 'pages2k_2110' 'pages2k_2150' 'pages2k_2156' 'pages2k_2214' 'pages2k_2220' 'pages2k_225' 'pages2k_2265' 'pages2k_2290' 'pages2k_2300' 'pages2k_2309' 'pages2k_2311' 'pages2k_2319' 'pages2k_2339' 'pages2k_2361' 'pages2k_238' 'pages2k_2402' 'pages2k_242' 'pages2k_2430' 'pages2k_2473' 'pages2k_2500' 'pages2k_2510' 'pages2k_2514' 'pages2k_2517' 'pages2k_2534' 'pages2k_2538' 'pages2k_2561' 'pages2k_258' 'pages2k_2592' 'pages2k_2595' 'pages2k_2604' 'pages2k_2606' 'pages2k_2609' 'pages2k_2612' 'pages2k_2613' 'pages2k_2617' 'pages2k_263' 'pages2k_2634' 'pages2k_2660' 'pages2k_267' 'pages2k_2677' 'pages2k_2703' 'pages2k_271' 'pages2k_2722' 'pages2k_273' 'pages2k_2750' 'pages2k_2755' 'pages2k_2759' 'pages2k_2795' 'pages2k_2798' 'pages2k_281' 'pages2k_2830' 'pages2k_2843' 'pages2k_2899' 'pages2k_2906' 'pages2k_2922' 'pages2k_294' 'pages2k_2953' 'pages2k_2959' 'pages2k_2976' 'pages2k_3002' 'pages2k_3030' 'pages2k_3033' 'pages2k_3038' 'pages2k_3064' 'pages2k_3068' 'pages2k_307' 'pages2k_3085' 'pages2k_309' 'pages2k_3107' 'pages2k_3108' 'pages2k_3132' 'pages2k_3134' 'pages2k_315' 'pages2k_317' 'pages2k_3170' 'pages2k_3179' 'pages2k_3191' 'pages2k_3202' 'pages2k_323' 'pages2k_3236' 'pages2k_3239' 'pages2k_3243' 'pages2k_3263' 'pages2k_3266' 'pages2k_3307' 'pages2k_3313' 'pages2k_3342' 'pages2k_3352' 'pages2k_3372' 'pages2k_3374' 'pages2k_3404' 'pages2k_3419' 'pages2k_3503' 'pages2k_3524' 'pages2k_3550' 'pages2k_3552' 'pages2k_3554' 'pages2k_3571' 'pages2k_3583' 'pages2k_3599' 'pages2k_3609' 'pages2k_3631' 'pages2k_3642' 'pages2k_385' 'pages2k_387' 'pages2k_395' 'pages2k_397' 'pages2k_409' 'pages2k_418' 'pages2k_421' 'pages2k_433' 'pages2k_445' 'pages2k_446' 'pages2k_462' 'pages2k_468' 'pages2k_474' 'pages2k_477' 'pages2k_478' 'pages2k_486' 'pages2k_495' 'pages2k_50' 'pages2k_500' 'pages2k_565' 'pages2k_583' 'pages2k_592' 'pages2k_6' 'pages2k_610' 'pages2k_626' 'pages2k_691' 'pages2k_730' 'pages2k_736' 'pages2k_800' 'pages2k_81' 'pages2k_818' 'pages2k_83' 'pages2k_830' 'pages2k_831' 'pages2k_842' 'pages2k_857' 'pages2k_88' 'pages2k_881' 'pages2k_895' 'pages2k_900' 'pages2k_94' 'pages2k_940' 'pages2k_945' 'pages2k_960' 'pages2k_976' 'pages2k_982' 'sisal_113.0_66' 'sisal_201.0_133' 'sisal_205.0_141' 'sisal_253.0_171' 'sisal_271.0_174' 'sisal_272.0_177' 'sisal_273.0_179' 'sisal_278.0_184' 'sisal_294.0_194' 'sisal_305.0_199' 'sisal_329.0_213' 'sisal_330.0_215' 'sisal_446.0_292' 'sisal_47.0_21' 'sisal_47.0_22' 'sisal_47.0_23' 'sisal_471.0_314' 'sisal_896.0_531' 'sisal_896.0_533'] all_merged
# len(df_dupfree_rmv)
# add columns on decision process to df_dupfree:
df_dupfree_rmv['duplicateDetails']='N/A'
for ID in dup_details:
if ID in df_dupfree_rmv.index:
if df_dupfree_rmv.at[ID, 'duplicateDetails']=='N/A':
df_dupfree_rmv.at[ID, 'duplicateDetails']=dup_details[ID]
else: df_dupfree_rmv.at[ID, 'duplicateDetails']+=dup_details[ID]
# df_dupfree_rmv[df_dupfree_rmv[ 'duplicateDetails']!='N/A'].at['ch2k_DE14DTO03_140', 'duplicateDetails']
2. Records to be COMPOSITED¶
Now identify all the records to which the decision 'COMPOSITE' applies to, create composites and store in df_composite.
# df.index
# add the column 'duplicateDetails' to df, in case it does not exist
if 'duplicateDetails' not in df.columns: df['duplicateDetails']='N/A'
# load the records to be composited
comp_ID_pairs = df_decisions[(df_decisions['Decision 1']=='COMPOSITE')&(df_decisions['Decision 2']=='COMPOSITE')]
# create new composite data and metadata from the pairs
# loop through the composite pairs and check metadata
df_composite = dup.join_composites_metadata(df, comp_ID_pairs, df_decisions, header)
FE23_northamerica_usa_me017 FE23_northamerica_usa_me018 -------------------------------------------------------------------------------- Metadata different for >>>geo_siteName<<< in: FE23_northamerica_usa_me017 (IronboundIsland) and FE23_northamerica_usa_me018 (IronboundIslandLongCores). -------------------------------------------------------------------------------- Metadata different for >>>interpretation_variable<<< in: FE23_northamerica_usa_me017 (moisture) and FE23_northamerica_usa_me018 (temperature+moisture).
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//all_merged/dup_detection//composite_FE23_northamerica_usa_me017_FE23_northamerica_usa_me018.pdf FE23_northamerica_usa_mt112 FE23_northamerica_usa_mt113 -------------------------------------------------------------------------------- Metadata different for >>>geo_siteName<<< in: FE23_northamerica_usa_mt112 (YellowMountainRidge1) and FE23_northamerica_usa_mt113 (YellowMountainRidge1-EntireBarkTrees).
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//all_merged/dup_detection//composite_FE23_northamerica_usa_mt112_FE23_northamerica_usa_mt113.pdf FE23_northamerica_usa_nj001 FE23_northamerica_usa_nj002 -------------------------------------------------------------------------------- Metadata different for >>>geo_siteName<<< in: FE23_northamerica_usa_nj001 (HutchensonForestwithLongCores) and FE23_northamerica_usa_nj002 (HutchensonForest).
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//all_merged/dup_detection//composite_FE23_northamerica_usa_nj001_FE23_northamerica_usa_nj002.pdf
print(df_composite.info())
# raise Exception
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3 entries, 0 to 2 Data columns (total 20 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 3 non-null object 1 geo_meanElev 3 non-null float32 2 geo_meanLat 3 non-null float32 3 geo_meanLon 3 non-null float32 4 geo_siteName 3 non-null object 5 paleoData_proxy 3 non-null object 6 yearUnits 3 non-null object 7 interpretation_variable 3 non-null object 8 interpretation_direction 3 non-null object 9 interpretation_seasonality 3 non-null object 10 paleoData_values 3 non-null object 11 year 3 non-null object 12 dataSetName 3 non-null object 13 originalDatabase 3 non-null object 14 originalDataURL 3 non-null object 15 paleoData_notes 3 non-null object 16 interpretation_variableDetail 3 non-null object 17 datasetId 3 non-null object 18 paleoData_units 3 non-null object 19 duplicateDetails 3 non-null object dtypes: float32(3), object(17) memory usage: 576.0+ bytes None
3. Check for overlap between duplicates¶
The duplicate free dataframe is obtained by joining
df_dupfree_rmv(duplicate free as all records with decisionREMOVEand/orCOMPOSITEremoved) anddf_composite(dupicate free as duplicates are composited)
There might still be duplicates between the two dataframes: when a record has been associated with more than 1 duplicate candidate pair.
The scenarios for duplicates appearing twice:
REMOVE/KEEPandCOMPOSITE:
- duplicate pair
aandbhave had the decisions assigned:a->REMOVE,b->KEEP - duplicate pair
aandchave had the decisions assigned:a->COMPOSITE,c->COMPOSITE.
In this case, b and ac (the composite record of a and c) would be duplicates in the merged dataframe
2a. REMOVE/KEEP & KEEP/REMOVE:
duplicate pair
aandbhave had the decisions assigned:a->REMOVE,b->KEEPduplicate pair
aandchave had the decisions assigned:a->KEEP,c->REMOVE.In this case
awould still be removed asREMOVEoverridesKEEPin the algorithm. So onlybwill be kept and no duplicates would remain.
2b. REMOVE/KEEP & REMOVE/KEEP
duplicate pair
aandbhave had the decisions assigned:a->REMOVE,b->KEEPduplicate pair
aandchave had the decisions assigned:a->REMOVE,c->KEEP.In this case,
awould be removed, butbandcwill be kept and would be duplicates in the merged dataframe.
COMPOSITEx 2
- duplicate pair
aandbhave had the decisions assigned:a->COMPOSITE,b->COMPOSITE - duplicate pair
aandchave had the decisions assigned:a->COMPOSITE,c->COMPOSITE.
In this case, ab and ac would be duplicates in the merged dataframe.
REMOVE/KEEPandKEEP/KEEP
duplicate pair
aandbhave had the decisions assigned:a->REMOVE,b->KEEPduplicate pair
aandchave had the decisions assigned:a->KEEP,c->KEEP.In this case
awould be removed,bandcwould be kept but ascis not a duplicate ofano duplicates would remain.
What about records which appear more than twice? These records would not be dealt with by the following approach, as this only removes duplciates which appear TWICE in the dataset.
The algorithm is currently NOT handling multiple duplicate records. In order to do this, we'd have to set up a WHILE loop which runs UNTIL no duplicate has been assigned to more than one decision! Doable!
Merge the composites and the dataframe freed from REMOVE and COMPOSITE type records
df_duprmv_cmp = pd.concat([df_dupfree_rmv, df_composite])
df_duprmv_cmp.index = df_duprmv_cmp['datasetId']
Create a loop which implements a mini duplicate detection on all the records which have multiple decisions associated.
# initiate the loop
tmp_df_duprmv_cmp = df_duprmv_cmp.copy()
tmp_decisions = decisions.copy()
for ii in range(10):
tmp_df_duprmv_cmp.set_index('datasetId', inplace = True)
tmp_df_duprmv_cmp['datasetId']=tmp_df_duprmv_cmp.index
print('-'*20)
print(f'ITERATION # {ii}')
# multiple_dups = [] # collects all IDs which may be associated with multiple duplicates
# for id in tmp_decisions.keys():
# if len(tmp_decisions[id])>1:
# if id not in multiple_dups:
# multiple_dups.append(id)
# if len(multiple_dups)>0: # check which of the multiples are still in the dataframe
# multiple_dups_new = []
# for id in tmp_df_duprmv_cmp.index:
# if np.any([id in cc_id for cc_id in tmp_df_duprmv_cmp.index]):
# multiple_dups_new.append(id)
# print(f'WARNING! Decisions associated with {len(multiple_dups_new)} multiple duplicates in the new dataframe.')
# print('Please review these records below and run through a further duplicate detection workflow until no more duplicates are found.')
# else:
# print('No more multiple duplicates.')
# print('SUCCESS!!')
# break
# Collect all IDs which may be associated with multiple duplicates
multiple_dups = []
for id in tmp_decisions.keys():
if len(tmp_decisions[id]) > 1:
if id not in multiple_dups:
multiple_dups.append(id)
if len(multiple_dups) > 0:
# FIX: Check which of the multiple duplicate IDs are still in the dataframe
multiple_dups_new = []
current_ids = set(tmp_df_duprmv_cmp.index) # Get all current IDs as a set
for id in multiple_dups:
if id in current_ids: # Simple membership check
multiple_dups_new.append(id)
if len(multiple_dups_new) > 0:
print(f'WARNING! Decisions associated with {len(multiple_dups_new)} multiple duplicates in the new dataframe.')
print('Please review these records below and run through a further duplicate detection workflow until no more duplicates are found.')
else:
print('No more multiple duplicates found in current dataframe.')
print('SUCCESS!!')
break
else:
print('No more multiple duplicates.')
print('SUCCESS!!')
break
# Now we create a small dataframe which needs to be checked for duplicates.
df_check = tmp_df_duprmv_cmp.copy()[np.isin(tmp_df_duprmv_cmp['datasetId'], multiple_dups_new)]
print('Check dataframe: ')
df_check.name = 'tmp'
df_check.index = range(len(df_check))
print(df_check.info())
# We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
if len(pot_dup_IDs)==0:
print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
break
else:
yn=''
while yn not in ['y', 'n']:
yn = input('Do you want to continue with the decision process for duplicates? [y/n]')
if yn=='n': break
df_check = dup.define_hierarchy(df_check)
dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True,
remove_identicals=False, backup=False, comment=False)
# implement the decisions
tmp_df_decisions = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
tmp_dup_details = dup.provide_dup_details(tmp_df_decisions, header)
# decisions
tmp_decisions = {}
for ind in tmp_df_decisions.index:
id1, id2 = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
for id, dec in zip([id1, id2], [dec1, dec2]):
if id not in tmp_decisions: tmp_decisions[id] = []
tmp_decisions[id]+=[dec]
df_check.set_index('datasetId', inplace = True)
df_check['datasetId']=df_check.index
#drop all REMOVE or COMPOSITE types
tmp_remove_IDs = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
tmp_df_dupfree_rmv = tmp_df_duprmv_cmp.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates
# # composite the
tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]
# # create new composite data and metadata from the pairs
# # loop through the composite pairs and check metadata
tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)
tmp_df_duprmv_cmp = pd.concat([tmp_df_dupfree_rmv, tmp_df_composite])
print('--'*20)
print('Finished iteration.')
print('NEW DATAFRAME:')
print(tmp_df_duprmv_cmp.info())
print('--'*20)
print('--'*20)
print('--'*20)
if ii==19: print('STILL DUPLICATES PRESENT AFTER MULTIPLE ITERATIONS! REVISE DECISION PROCESS!!')
-------------------- ITERATION # 0 WARNING! Decisions associated with 56 multiple duplicates in the new dataframe. Please review these records below and run through a further duplicate detection workflow until no more duplicates are found. Check dataframe: <class 'pandas.core.frame.DataFrame'> RangeIndex: 56 entries, 0 to 55 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 56 non-null object 1 dataSetName 56 non-null object 2 geo_meanElev 54 non-null float32 3 geo_meanLat 56 non-null float32 4 geo_meanLon 56 non-null float32 5 geo_siteName 56 non-null object 6 interpretation_direction 56 non-null object 7 interpretation_seasonality 56 non-null object 8 interpretation_variable 56 non-null object 9 interpretation_variableDetail 56 non-null object 10 originalDataURL 56 non-null object 11 originalDatabase 56 non-null object 12 paleoData_notes 56 non-null object 13 paleoData_proxy 56 non-null object 14 paleoData_sensorSpecies 56 non-null object 15 paleoData_units 56 non-null object 16 paleoData_values 56 non-null object 17 paleoData_variableName 56 non-null object 18 year 56 non-null object 19 yearUnits 56 non-null object 20 duplicateDetails 56 non-null object 21 datasetId 56 non-null object dtypes: float32(3), object(19) memory usage: 9.1+ KB None tmp Start duplicate search: ================================= checking parameters: proxy archive : must match proxy type : must match distance (km) < 8 elevation : must match time overlap > 10 correlation > 0.9 RMSE < 0.1 1st difference rmse < 0.1 correlation of 1st difference > 0.9 ================================= Start duplicate search Progress: 0/56 --> Found potential duplicate: 5: pages2k_2146&6: pages2k_2149 (n_potential_duplicates=1) Progress: 10/56 Progress: 20/56 Progress: 30/56 --> Found potential duplicate: 36: iso2k_1069&46: iso2k_1660 (n_potential_duplicates=2) Progress: 40/56 Progress: 50/56 ============================================================ Saved indices, IDs, distances, correlations in data/tmp/dup_detection/ ============================================================ Detected 2 possible duplicates in tmp. ============================================================ ============================================================ Indices: [5, 6], [36, 46] IDs: pages2k_2146 + pages2k_2149, iso2k_1069 + iso2k_1660 ============================================================
No back up. ------------------------------------------------------------ ------------------------------------------------------------ > 1/2,pages2k_2146,pages2k_2149,0.0,0.9676859064640569 ==================================================================== === POTENTIAL DUPLICATE 0/2: pages2k_2146+pages2k_2149 === === URL 1: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/SAm-CentralAndes6.Villalba.2014.txt === === URL 2: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/SAm-CentralAndes6.Villalba.2014.txt === True if pot_dup_corrs[i_pot_dups]>=0.98 else False False (len(time_1)==len(time_2)) True metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: True data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).**
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/000_pages2k_2146_pages2k_2149__5_6.jpg KEEP BOTH: keep pages2k_2146, keep pages2k_2149. write decision to backup file > 2/2,iso2k_1069,iso2k_1660,0.0,0.9865701575318525 ==================================================================== === POTENTIAL DUPLICATE 1/2: iso2k_1069+iso2k_1660 === === URL 1: https://doi.pangaea.de/10.1594/PANGAEA.824732 === === URL 2: https://doi.pangaea.de/10.1594/PANGAEA.871279 === True if pot_dup_corrs[i_pot_dups]>=0.98 else False True (len(time_1)==len(time_2)) False metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: False data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).**
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/001_iso2k_1069_iso2k_1660__36_46.jpg KEEP BOTH: keep iso2k_1069, keep iso2k_1660. write decision to backup file ===================================================================== END OF DUPLICATE DECISION PROCESS. =====================================================================
(2, 27) Saved the decisions under data/tmp/dup_detection/dup_decisions_tmp_LL_25-12-11.csv Summary of all decisions made: #0: KEEP record pages2k_2146. KEEP record pages2k_2149. #1: KEEP record iso2k_1069. KEEP record iso2k_1660. ---------------------------------------- Finished iteration. NEW DATAFRAME: <class 'pandas.core.frame.DataFrame'> Index: 4967 entries, pages2k_5 to dod2k_composite_z_FE23_northamerica_usa_nj001_FE23_northamerica_usa_nj002 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4967 non-null object 1 dataSetName 4967 non-null object 2 geo_meanElev 4884 non-null float32 3 geo_meanLat 4967 non-null float32 4 geo_meanLon 4967 non-null float32 5 geo_siteName 4967 non-null object 6 interpretation_direction 4967 non-null object 7 interpretation_seasonality 4967 non-null object 8 interpretation_variable 4967 non-null object 9 interpretation_variableDetail 4967 non-null object 10 originalDataURL 4967 non-null object 11 originalDatabase 4967 non-null object 12 paleoData_notes 4967 non-null object 13 paleoData_proxy 4967 non-null object 14 paleoData_sensorSpecies 4964 non-null object 15 paleoData_units 4967 non-null object 16 paleoData_values 4967 non-null object 17 paleoData_variableName 4964 non-null object 18 year 4967 non-null object 19 yearUnits 4967 non-null object 20 duplicateDetails 4967 non-null object 21 datasetId 4967 non-null object dtypes: float32(3), object(19) memory usage: 834.3+ KB None ---------------------------------------- ---------------------------------------- ---------------------------------------- -------------------- ITERATION # 1 No more multiple duplicates. SUCCESS!!
4. Create duplicate free dataframe¶
df_dupfree_prelim = tmp_df_duprmv_cmp
print(df_dupfree_prelim.info())
<class 'pandas.core.frame.DataFrame'> Index: 4967 entries, pages2k_5 to dod2k_composite_z_FE23_northamerica_usa_nj001_FE23_northamerica_usa_nj002 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4967 non-null object 1 dataSetName 4967 non-null object 2 geo_meanElev 4884 non-null float32 3 geo_meanLat 4967 non-null float32 4 geo_meanLon 4967 non-null float32 5 geo_siteName 4967 non-null object 6 interpretation_direction 4967 non-null object 7 interpretation_seasonality 4967 non-null object 8 interpretation_variable 4967 non-null object 9 interpretation_variableDetail 4967 non-null object 10 originalDataURL 4967 non-null object 11 originalDatabase 4967 non-null object 12 paleoData_notes 4967 non-null object 13 paleoData_proxy 4967 non-null object 14 paleoData_sensorSpecies 4964 non-null object 15 paleoData_units 4967 non-null object 16 paleoData_values 4967 non-null object 17 paleoData_variableName 4964 non-null object 18 year 4967 non-null object 19 yearUnits 4967 non-null object 20 duplicateDetails 4967 non-null object 21 datasetId 4967 non-null object dtypes: float32(3), object(19) memory usage: 834.3+ KB None
Do another duplicate search on the whole dataframe to make sure there are no duplicates present anymore.
tmp_df_duprmv_cmp = df_dupfree_prelim.copy()
tmp_df_duprmv_cmp.set_index('datasetId', inplace = True)
tmp_df_duprmv_cmp['datasetId']=tmp_df_duprmv_cmp.index
# Now we create a dataframe which needs to be checked for duplicates.
df_check = tmp_df_duprmv_cmp.copy()
df_check.name = 'tmp'
df_check.index = range(len(df_check))
# We then run a brief duplicate detection algorithm on the dataframe. Note that by default the composited data has the highest value in the hierarchy.
pot_dup_IDs = dup.find_duplicates_optimized(df_check, n_points_thresh=10, return_data=True)
if len(pot_dup_IDs)==0:
print('SUCCESS!! NO MORE DUPLICATES DETECTED!!')
else:
df_check = dup.define_hierarchy(df_check)
dup.duplicate_decisions_multiple(df_check, operator_details=operator_details, choose_recollection=True,
remove_identicals=False, backup=False)
# implement the decisions
tmp_df_decisions = pd.read_csv(f'data/{df_check.name}/dup_detection/dup_decisions_{df_check.name}_{initials}_{date}'+'.csv', header=5)
tmp_dup_details = dup.provide_dup_details(tmp_df_decisions, header)
# decisions
tmp_decisions = {}
for ind in tmp_df_decisions.index:
id1, id2 = tmp_df_decisions.loc[ind, ['datasetId 1', 'datasetId 2']]
dec1, dec2 = tmp_df_decisions.loc[ind, ['Decision 1', 'Decision 2']]
for id, dec in zip([id1, id2], [dec1, dec2]):
if id not in tmp_decisions: tmp_decisions[id] = []
tmp_decisions[id]+=[dec]
df_check.set_index('datasetId', inplace = True)
df_check['datasetId']=df_check.index
#drop all REMOVE or COMPOSITE types
tmp_remove_IDs = list(tmp_df_decisions['datasetId 1'][np.isin(tmp_df_decisions['Decision 1'],['REMOVE', 'COMPOSITE'])])
tmp_remove_IDs += list(tmp_df_decisions['datasetId 2'][np.isin(tmp_df_decisions['Decision 2'],['REMOVE', 'COMPOSITE'])])
tmp_remove_IDs = np.unique(tmp_remove_IDs)#[id for id in np.unique(tmp_remove_IDs) if id not in tmp_remove_IDs]
tmp_df_dupfree_rmv = tmp_df_duprmv_cmp.drop(tmp_remove_IDs) # df freed from 'REMOVE' type duplicates
# # composite the
tmp_comp_ID_pairs = tmp_df_decisions[(tmp_df_decisions['Decision 1']=='COMPOSITE')&(tmp_df_decisions['Decision 2']=='COMPOSITE')]
# # create new composite data and metadata from the pairs
# # loop through the composite pairs and check metadata
tmp_df_composite = dup.join_composites_metadata(df_check, tmp_comp_ID_pairs, tmp_df_decisions, header)
tmp_df_duprmv_cmp = pd.concat([tmp_df_dupfree_rmv, tmp_df_composite])
print('Finished last round of duplicate removal.')
print('Potentially run through this cell again to check for remaining duplicates.')
tmp Start duplicate search: ================================= checking parameters: proxy archive : must match proxy type : must match distance (km) < 8 elevation : must match time overlap > 10 correlation > 0.9 RMSE < 0.1 1st difference rmse < 0.1 correlation of 1st difference > 0.9 ================================= Start duplicate search Progress: 0/4967 Progress: 10/4967 --> Found potential duplicate: 17: pages2k_62&18: pages2k_63 (n_potential_duplicates=1) Progress: 20/4967 Progress: 30/4967 Progress: 40/4967 Progress: 50/4967 Progress: 60/4967 Progress: 70/4967 Progress: 80/4967 Progress: 90/4967 Progress: 100/4967 Progress: 110/4967 Progress: 120/4967 Progress: 130/4967 Progress: 140/4967 Progress: 150/4967 Progress: 160/4967 Progress: 170/4967 Progress: 180/4967 Progress: 190/4967 Progress: 200/4967 Progress: 210/4967 Progress: 220/4967 Progress: 230/4967 Progress: 240/4967 Progress: 250/4967 Progress: 260/4967 Progress: 270/4967 Progress: 280/4967 Progress: 290/4967 Progress: 300/4967 Progress: 310/4967 Progress: 320/4967 Progress: 330/4967 Progress: 340/4967 Progress: 350/4967 Progress: 360/4967 Progress: 370/4967 Progress: 380/4967 Progress: 390/4967 Progress: 400/4967 Progress: 410/4967 Progress: 420/4967 Progress: 430/4967 Progress: 440/4967 Progress: 450/4967 Progress: 460/4967 Progress: 470/4967 Progress: 480/4967 Progress: 490/4967 Progress: 500/4967 Progress: 510/4967 --> Found potential duplicate: 516: pages2k_1686&517: pages2k_1688 (n_potential_duplicates=2) Progress: 520/4967 Progress: 530/4967 Progress: 540/4967 Progress: 550/4967 Progress: 560/4967 Progress: 570/4967 Progress: 580/4967 Progress: 590/4967 Progress: 600/4967 Progress: 610/4967 Progress: 620/4967 Progress: 630/4967 --> Found potential duplicate: 638: pages2k_2098&640: pages2k_2103 (n_potential_duplicates=3) Progress: 640/4967 Progress: 650/4967 --> Found potential duplicate: 656: pages2k_2146&658: pages2k_2149 (n_potential_duplicates=4) Progress: 660/4967 Progress: 670/4967 Progress: 680/4967 Progress: 690/4967 Progress: 700/4967 Progress: 710/4967 Progress: 720/4967 Progress: 730/4967 Progress: 740/4967 Progress: 750/4967 Progress: 760/4967 Progress: 770/4967 Progress: 780/4967 Progress: 790/4967 Progress: 800/4967 Progress: 810/4967 Progress: 820/4967 Progress: 830/4967 Progress: 840/4967 Progress: 850/4967 Progress: 860/4967 Progress: 870/4967 Progress: 880/4967 Progress: 890/4967 Progress: 900/4967 Progress: 910/4967 Progress: 920/4967 Progress: 930/4967 Progress: 940/4967 Progress: 950/4967 Progress: 960/4967 Progress: 970/4967 Progress: 980/4967 Progress: 990/4967 Progress: 1000/4967 Progress: 1010/4967 Progress: 1020/4967 Progress: 1030/4967 Progress: 1040/4967 Progress: 1050/4967 Progress: 1060/4967 Progress: 1070/4967 Progress: 1080/4967 Progress: 1090/4967 Progress: 1100/4967 Progress: 1110/4967 Progress: 1120/4967 Progress: 1130/4967 Progress: 1140/4967 Progress: 1150/4967 Progress: 1160/4967 Progress: 1170/4967 Progress: 1180/4967 Progress: 1190/4967 Progress: 1200/4967 Progress: 1210/4967 Progress: 1220/4967 Progress: 1230/4967 Progress: 1240/4967 Progress: 1250/4967 Progress: 1260/4967 Progress: 1270/4967 Progress: 1280/4967 Progress: 1290/4967 Progress: 1300/4967 Progress: 1310/4967 Progress: 1320/4967 Progress: 1330/4967 Progress: 1340/4967 Progress: 1350/4967 Progress: 1360/4967 Progress: 1370/4967 Progress: 1380/4967 Progress: 1390/4967 Progress: 1400/4967 Progress: 1410/4967 Progress: 1420/4967 Progress: 1430/4967 Progress: 1440/4967 Progress: 1450/4967 Progress: 1460/4967 Progress: 1470/4967 Progress: 1480/4967 Progress: 1490/4967 Progress: 1500/4967 Progress: 1510/4967 Progress: 1520/4967 Progress: 1530/4967 Progress: 1540/4967 Progress: 1550/4967 Progress: 1560/4967 Progress: 1570/4967 Progress: 1580/4967 Progress: 1590/4967 Progress: 1600/4967 Progress: 1610/4967 Progress: 1620/4967 Progress: 1630/4967 Progress: 1640/4967 Progress: 1650/4967 Progress: 1660/4967 Progress: 1670/4967 Progress: 1680/4967 Progress: 1690/4967 Progress: 1700/4967 Progress: 1710/4967 Progress: 1720/4967 Progress: 1730/4967 Progress: 1740/4967 Progress: 1750/4967 Progress: 1760/4967 Progress: 1770/4967 Progress: 1780/4967 Progress: 1790/4967 Progress: 1800/4967 Progress: 1810/4967 Progress: 1820/4967 Progress: 1830/4967 Progress: 1840/4967 Progress: 1850/4967 Progress: 1860/4967 Progress: 1870/4967 Progress: 1880/4967 Progress: 1890/4967 Progress: 1900/4967 Progress: 1910/4967 Progress: 1920/4967 Progress: 1930/4967 Progress: 1940/4967 Progress: 1950/4967 Progress: 1960/4967 Progress: 1970/4967 Progress: 1980/4967 Progress: 1990/4967 Progress: 2000/4967 Progress: 2010/4967 Progress: 2020/4967 Progress: 2030/4967 Progress: 2040/4967 Progress: 2050/4967 Progress: 2060/4967 Progress: 2070/4967 Progress: 2080/4967 Progress: 2090/4967 Progress: 2100/4967 Progress: 2110/4967 Progress: 2120/4967 Progress: 2130/4967 Progress: 2140/4967 Progress: 2150/4967 Progress: 2160/4967 Progress: 2170/4967 Progress: 2180/4967 Progress: 2190/4967 Progress: 2200/4967 Progress: 2210/4967 Progress: 2220/4967 Progress: 2230/4967 Progress: 2240/4967 Progress: 2250/4967 Progress: 2260/4967 Progress: 2270/4967 Progress: 2280/4967 Progress: 2290/4967 Progress: 2300/4967 Progress: 2310/4967 Progress: 2320/4967 Progress: 2330/4967 Progress: 2340/4967 Progress: 2350/4967 Progress: 2360/4967 Progress: 2370/4967 Progress: 2380/4967 Progress: 2390/4967 Progress: 2400/4967 Progress: 2410/4967 Progress: 2420/4967 Progress: 2430/4967 Progress: 2440/4967 Progress: 2450/4967 Progress: 2460/4967 Progress: 2470/4967 Progress: 2480/4967 Progress: 2490/4967 Progress: 2500/4967 Progress: 2510/4967 Progress: 2520/4967 Progress: 2530/4967 Progress: 2540/4967 Progress: 2550/4967 Progress: 2560/4967 Progress: 2570/4967 Progress: 2580/4967 Progress: 2590/4967 Progress: 2600/4967 Progress: 2610/4967 Progress: 2620/4967 Progress: 2630/4967 Progress: 2640/4967 Progress: 2650/4967 Progress: 2660/4967 Progress: 2670/4967 Progress: 2680/4967 Progress: 2690/4967 Progress: 2700/4967 Progress: 2710/4967 Progress: 2720/4967 Progress: 2730/4967 Progress: 2740/4967 Progress: 2750/4967 Progress: 2760/4967 Progress: 2770/4967 Progress: 2780/4967 Progress: 2790/4967 Progress: 2800/4967 Progress: 2810/4967 Progress: 2820/4967 Progress: 2830/4967 Progress: 2840/4967 Progress: 2850/4967 Progress: 2860/4967 Progress: 2870/4967 Progress: 2880/4967 Progress: 2890/4967 Progress: 2900/4967 Progress: 2910/4967 Progress: 2920/4967 Progress: 2930/4967 Progress: 2940/4967 Progress: 2950/4967 Progress: 2960/4967 Progress: 2970/4967 Progress: 2980/4967 Progress: 2990/4967 Progress: 3000/4967 Progress: 3010/4967 Progress: 3020/4967 Progress: 3030/4967 Progress: 3040/4967 Progress: 3050/4967 Progress: 3060/4967 Progress: 3070/4967 Progress: 3080/4967 Progress: 3090/4967 Progress: 3100/4967 Progress: 3110/4967 Progress: 3120/4967 Progress: 3130/4967 Progress: 3140/4967 Progress: 3150/4967 Progress: 3160/4967 Progress: 3170/4967 Progress: 3180/4967 Progress: 3190/4967 Progress: 3200/4967 Progress: 3210/4967 Progress: 3220/4967 Progress: 3230/4967 Progress: 3240/4967 Progress: 3250/4967 Progress: 3260/4967 Progress: 3270/4967 Progress: 3280/4967 Progress: 3290/4967 Progress: 3300/4967 Progress: 3310/4967 Progress: 3320/4967 Progress: 3330/4967 Progress: 3340/4967 Progress: 3350/4967 Progress: 3360/4967 Progress: 3370/4967 Progress: 3380/4967 Progress: 3390/4967 Progress: 3400/4967 Progress: 3410/4967 Progress: 3420/4967 Progress: 3430/4967 Progress: 3440/4967 Progress: 3450/4967 Progress: 3460/4967 Progress: 3470/4967 Progress: 3480/4967 Progress: 3490/4967 Progress: 3500/4967 Progress: 3510/4967 Progress: 3520/4967 Progress: 3530/4967 Progress: 3540/4967 Progress: 3550/4967 Progress: 3560/4967 Progress: 3570/4967 Progress: 3580/4967 Progress: 3590/4967 Progress: 3600/4967 Progress: 3610/4967 Progress: 3620/4967 Progress: 3630/4967 Progress: 3640/4967 Progress: 3650/4967 Progress: 3660/4967 Progress: 3670/4967 Progress: 3680/4967 Progress: 3690/4967 Progress: 3700/4967 Progress: 3710/4967 Progress: 3720/4967 Progress: 3730/4967 Progress: 3740/4967 Progress: 3750/4967 Progress: 3760/4967 Progress: 3770/4967 Progress: 3780/4967 Progress: 3790/4967 Progress: 3800/4967 Progress: 3810/4967 Progress: 3820/4967 Progress: 3830/4967 Progress: 3840/4967 Progress: 3850/4967 Progress: 3860/4967 Progress: 3870/4967 Progress: 3880/4967 Progress: 3890/4967 Progress: 3900/4967 --> Found potential duplicate: 3902: ch2k_de14dto03_140&3905: ch2k_de14dto01_148 (n_potential_duplicates=5) Progress: 3910/4967 Progress: 3920/4967 --> Found potential duplicate: 3921: ch2k_he13mis01_194&4066: iso2k_211 (n_potential_duplicates=6) Progress: 3930/4967 Progress: 3940/4967 Progress: 3950/4967 Progress: 3960/4967 Progress: 3970/4967 Progress: 3980/4967 Progress: 3990/4967 --> Found potential duplicate: 3992: ch2k_fl18dto01_460&4011: ch2k_fl18dto02_554 (n_potential_duplicates=7) Progress: 4000/4967 Progress: 4010/4967 Progress: 4020/4967 Progress: 4030/4967 Progress: 4040/4967 Progress: 4050/4967 Progress: 4060/4967 Progress: 4070/4967 Progress: 4080/4967 Progress: 4090/4967 Progress: 4100/4967 Progress: 4110/4967
/home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2897: RuntimeWarning: invalid value encountered in divide c /= stddev[:, None] /home/jupyter-mnevans/.conda/envs/cfr-env/lib/python3.11/site-packages/numpy/lib/function_base.py:2898: RuntimeWarning: invalid value encountered in divide c /= stddev[None, :]
Progress: 4120/4967 Progress: 4130/4967 Progress: 4140/4967 --> Found potential duplicate: 4146: iso2k_533&4502: sisal_115.0_69 (n_potential_duplicates=8) Progress: 4150/4967 Progress: 4160/4967 Progress: 4170/4967 Progress: 4180/4967 Progress: 4190/4967 Progress: 4200/4967 Progress: 4210/4967 Progress: 4220/4967 Progress: 4230/4967 Progress: 4240/4967 Progress: 4250/4967 --> Found potential duplicate: 4252: iso2k_1069&4368: iso2k_1660 (n_potential_duplicates=9) Progress: 4260/4967 Progress: 4270/4967 Progress: 4280/4967 Progress: 4290/4967 Progress: 4300/4967 Progress: 4310/4967 Progress: 4320/4967 Progress: 4330/4967 Progress: 4340/4967 Progress: 4350/4967 Progress: 4360/4967 Progress: 4370/4967 Progress: 4380/4967 Progress: 4390/4967 Progress: 4400/4967 Progress: 4410/4967 Progress: 4420/4967 Progress: 4430/4967 Progress: 4440/4967 Progress: 4450/4967 Progress: 4460/4967 Progress: 4470/4967 Progress: 4480/4967 Progress: 4490/4967 Progress: 4500/4967 Progress: 4510/4967 Progress: 4520/4967 Progress: 4530/4967 Progress: 4540/4967 Progress: 4550/4967 Progress: 4560/4967 Progress: 4570/4967 Progress: 4580/4967 Progress: 4590/4967 Progress: 4600/4967 Progress: 4610/4967 Progress: 4620/4967 Progress: 4630/4967 Progress: 4640/4967 Progress: 4650/4967 Progress: 4660/4967 Progress: 4670/4967 Progress: 4680/4967 Progress: 4690/4967 Progress: 4700/4967 Progress: 4710/4967 Progress: 4720/4967 Progress: 4730/4967 Progress: 4740/4967 Progress: 4750/4967 Progress: 4760/4967 Progress: 4770/4967 Progress: 4780/4967 Progress: 4790/4967 Progress: 4800/4967 Progress: 4810/4967 Progress: 4820/4967 Progress: 4830/4967 Progress: 4840/4967 Progress: 4850/4967 Progress: 4860/4967 Progress: 4870/4967 Progress: 4880/4967 Progress: 4890/4967 Progress: 4900/4967 Progress: 4910/4967 Progress: 4920/4967 Progress: 4930/4967 Progress: 4940/4967 Progress: 4950/4967 Progress: 4960/4967 ============================================================ Saved indices, IDs, distances, correlations in data/tmp/dup_detection/ ============================================================ Detected 9 possible duplicates in tmp. ============================================================ ============================================================ Indices: [17, 18], [516, 517], [638, 640], [656, 658], [3902, 3905], [3921, 4066], [3992, 4011], [4146, 4502], [4252, 4368] IDs: pages2k_62 + pages2k_63, pages2k_1686 + pages2k_1688, pages2k_2098 + pages2k_2103, pages2k_2146 + pages2k_2149, ch2k_DE14DTO03_140 + ch2k_DE14DTO01_148, ch2k_HE13MIS01_194 + iso2k_211, ch2k_FL18DTO01_460 + ch2k_FL18DTO02_554, iso2k_533 + sisal_115.0_69, iso2k_1069 + iso2k_1660 ============================================================ No back up. ------------------------------------------------------------ ------------------------------------------------------------ > 1/9,pages2k_62,pages2k_63,0.0,0.9442037258051723 ==================================================================== === POTENTIAL DUPLICATE 0/9: pages2k_62+pages2k_63 === === URL 1: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/NAm-SmithersSkiArea.Schweingruber.1996-2.txt === === URL 2: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/NAm-SmithersSkiArea.Schweingruber.1996-2.txt === True if pot_dup_corrs[i_pot_dups]>=0.98 else False False (len(time_1)==len(time_2)) True metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: True data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/000_pages2k_62_pages2k_63__17_18.jpg KEEP BOTH: keep pages2k_62, keep pages2k_63. write decision to backup file > 2/9,pages2k_1686,pages2k_1688,0.0,1.0 ==================================================================== === POTENTIAL DUPLICATE 1/9: pages2k_1686+pages2k_1688 === === URL 1: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/Ocn-ArabianSea.Doose-Rolinski.2001-1.txt === === URL 2: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/Ocn-ArabianSea.Doose-Rolinski.2001-2.txt === True if pot_dup_corrs[i_pot_dups]>=0.98 else False True (len(time_1)==len(time_2)) False metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: False data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/001_pages2k_1686_pages2k_1688__516_517.jpg KEEP BOTH: keep pages2k_1686, keep pages2k_1688. write decision to backup file > 3/9,pages2k_2098,pages2k_2103,0.0,0.990390473253425 ==================================================================== === POTENTIAL DUPLICATE 2/9: pages2k_2098+pages2k_2103 === === URL 1: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/Ocn-CariacoBasin.Lea.2003-1.txt === === URL 2: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/Ocn-CariacoBasin.Lea.2003-2.txt === True if pot_dup_corrs[i_pot_dups]>=0.98 else False True (len(time_1)==len(time_2)) False metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: False data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/002_pages2k_2098_pages2k_2103__638_640.jpg KEEP BOTH: keep pages2k_2098, keep pages2k_2103. write decision to backup file > 4/9,pages2k_2146,pages2k_2149,0.0,0.9676859064640569 ==================================================================== === POTENTIAL DUPLICATE 3/9: pages2k_2146+pages2k_2149 === === URL 1: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/SAm-CentralAndes6.Villalba.2014.txt === === URL 2: https://www1.ncdc.noaa.gov/pub/data/paleo/pages2k/pages2k-temperature-v2-2017/data-version-2.0.0/SAm-CentralAndes6.Villalba.2014.txt === True if pot_dup_corrs[i_pot_dups]>=0.98 else False False (len(time_1)==len(time_2)) True metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: True data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/003_pages2k_2146_pages2k_2149__656_658.jpg KEEP BOTH: keep pages2k_2146, keep pages2k_2149. write decision to backup file > 5/9,ch2k_DE14DTO03_140,ch2k_DE14DTO01_148,0.0,0.9484736035921149 ==================================================================== === POTENTIAL DUPLICATE 4/9: ch2k_DE14DTO03_140+ch2k_DE14DTO01_148 === === URL 1: https://www.ncdc.noaa.gov/paleo/study/11935 === === URL 2: https://www.ncdc.noaa.gov/paleo/study/16217 === True if pot_dup_corrs[i_pot_dups]>=0.98 else False False (len(time_1)==len(time_2)) False metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: False data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/004_ch2k_DE14DTO03_140_ch2k_DE14DTO01_148__3902_3905.jpg KEEP BOTH: keep ch2k_DE14DTO03_140, keep ch2k_DE14DTO01_148. write decision to backup file > 6/9,ch2k_HE13MIS01_194,iso2k_211,0.18098363234620218,0.9277960229464544 ==================================================================== === POTENTIAL DUPLICATE 5/9: ch2k_HE13MIS01_194+iso2k_211 === === URL 1: https://www.ncdc.noaa.gov/paleo/study/15794 === === URL 2: https://www.ncdc.noaa.gov/paleo/study/15794 === True if pot_dup_corrs[i_pot_dups]>=0.98 else False False (len(time_1)==len(time_2)) False metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: True data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/005_ch2k_HE13MIS01_194_iso2k_211__3921_4066.jpg KEEP BOTH: keep ch2k_HE13MIS01_194, keep iso2k_211. write decision to backup file > 7/9,ch2k_FL18DTO01_460,ch2k_FL18DTO02_554,0.34539124455148484,0.8307245439795362 ==================================================================== === POTENTIAL DUPLICATE 6/9: ch2k_FL18DTO01_460+ch2k_FL18DTO02_554 === === URL 1: https://www.ncdc.noaa.gov/paleo/study/34553 === === URL 2: https://www.ncdc.noaa.gov/paleo/study/34553 === True if pot_dup_corrs[i_pot_dups]>=0.98 else False False (len(time_1)==len(time_2)) False metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: True data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/006_ch2k_FL18DTO01_460_ch2k_FL18DTO02_554__3992_4011.jpg KEEP BOTH: keep ch2k_FL18DTO01_460, keep ch2k_FL18DTO02_554. write decision to backup file > 8/9,iso2k_533,sisal_115.0_69,0.4944092732924204,0.9999999999999998 ==================================================================== === POTENTIAL DUPLICATE 7/9: iso2k_533+sisal_115.0_69 === === URL 1: https://www.ncdc.noaa.gov/paleo-search/study/5427 === === URL 2: ['10.1126/science.1091220'] === True if pot_dup_corrs[i_pot_dups]>=0.98 else False True (len(time_1)==len(time_2)) False metadata_identical: False lat True lon True elevation True archivetype False paleodata_proxy True sites_identical: False URL_identical: False data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/007_iso2k_533_sisal_115.0_69__4146_4502.jpg KEEP BOTH: keep iso2k_533, keep sisal_115.0_69. write decision to backup file > 9/9,iso2k_1069,iso2k_1660,0.0,0.9865701575318525 ==================================================================== === POTENTIAL DUPLICATE 8/9: iso2k_1069+iso2k_1660 === === URL 1: https://doi.pangaea.de/10.1594/PANGAEA.824732 === === URL 2: https://doi.pangaea.de/10.1594/PANGAEA.871279 === True if pot_dup_corrs[i_pot_dups]>=0.98 else False True (len(time_1)==len(time_2)) False metadata_identical: True lat True lon True elevation True archivetype True paleodata_proxy True sites_identical: True URL_identical: False data_identical: False correlation_perfect: False
**Decision required for this duplicate pair (see figure above).** Before inputting your decision. Would you like to leave a comment on your decision process?
saved figure in /home/jupyter-lluecke/dod2k_v2.0/dod2k/figs//dup_detection/tmp/008_iso2k_1069_iso2k_1660__4252_4368.jpg KEEP BOTH: keep iso2k_1069, keep iso2k_1660. write decision to backup file ===================================================================== END OF DUPLICATE DECISION PROCESS. =====================================================================
(9, 27) Saved the decisions under data/tmp/dup_detection/dup_decisions_tmp_LL_25-12-11.csv Summary of all decisions made: #0: KEEP record pages2k_62. KEEP record pages2k_63. #1: KEEP record pages2k_1686. KEEP record pages2k_1688. #2: KEEP record pages2k_2098. KEEP record pages2k_2103. #3: KEEP record pages2k_2146. KEEP record pages2k_2149. #4: KEEP record ch2k_DE14DTO03_140. KEEP record ch2k_DE14DTO01_148. #5: KEEP record ch2k_HE13MIS01_194. KEEP record iso2k_211. #6: KEEP record ch2k_FL18DTO01_460. KEEP record ch2k_FL18DTO02_554. #7: KEEP record iso2k_533. KEEP record sisal_115.0_69. #8: KEEP record iso2k_1069. KEEP record iso2k_1660. Finished last round of duplicate removal. Potentially run through this cell again to check for remaining duplicates.
df_dupfree = tmp_df_duprmv_cmp
print(df_dupfree.info())
<class 'pandas.core.frame.DataFrame'> Index: 4967 entries, pages2k_5 to dod2k_composite_z_FE23_northamerica_usa_nj001_FE23_northamerica_usa_nj002 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4967 non-null object 1 dataSetName 4967 non-null object 2 geo_meanElev 4884 non-null float32 3 geo_meanLat 4967 non-null float32 4 geo_meanLon 4967 non-null float32 5 geo_siteName 4967 non-null object 6 interpretation_direction 4967 non-null object 7 interpretation_seasonality 4967 non-null object 8 interpretation_variable 4967 non-null object 9 interpretation_variableDetail 4967 non-null object 10 originalDataURL 4967 non-null object 11 originalDatabase 4967 non-null object 12 paleoData_notes 4967 non-null object 13 paleoData_proxy 4967 non-null object 14 paleoData_sensorSpecies 4964 non-null object 15 paleoData_units 4967 non-null object 16 paleoData_values 4967 non-null object 17 paleoData_variableName 4964 non-null object 18 year 4967 non-null object 19 yearUnits 4967 non-null object 20 duplicateDetails 4967 non-null object 21 datasetId 4967 non-null object dtypes: float32(3), object(19) memory usage: 834.3+ KB None
Save duplicate free dataframe¶
Sort the columns and assign a name to the dataframe which is used for saving purposes (determines directory and filename). Make sure that date and operator initials initials are used in the name.
df_dupfree = df_dupfree[sorted(df_dupfree.columns)]
df_dupfree.name =f'{df.name}_{initials}_{date}_dupfree'
os.makedirs(f'data/{df_dupfree.name}/', exist_ok=True)
df_dupfree.info()
print(df_dupfree.name)
<class 'pandas.core.frame.DataFrame'> Index: 4967 entries, pages2k_5 to dod2k_composite_z_FE23_northamerica_usa_nj001_FE23_northamerica_usa_nj002 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4967 non-null object 1 dataSetName 4967 non-null object 2 datasetId 4967 non-null object 3 duplicateDetails 4967 non-null object 4 geo_meanElev 4884 non-null float32 5 geo_meanLat 4967 non-null float32 6 geo_meanLon 4967 non-null float32 7 geo_siteName 4967 non-null object 8 interpretation_direction 4967 non-null object 9 interpretation_seasonality 4967 non-null object 10 interpretation_variable 4967 non-null object 11 interpretation_variableDetail 4967 non-null object 12 originalDataURL 4967 non-null object 13 originalDatabase 4967 non-null object 14 paleoData_notes 4967 non-null object 15 paleoData_proxy 4967 non-null object 16 paleoData_sensorSpecies 4964 non-null object 17 paleoData_units 4967 non-null object 18 paleoData_values 4967 non-null object 19 paleoData_variableName 4964 non-null object 20 year 4967 non-null object 21 yearUnits 4967 non-null object dtypes: float32(3), object(19) memory usage: 834.3+ KB all_merged_LL_25-12-11_dupfree
save pickle¶
# save concatenate dataframe as db_merged
df_dupfree.to_pickle(f'data/{df_dupfree.name}/{df_dupfree.name}_compact.pkl')
save csv¶
# save to a list of csv files (metadata, data, year)
utf.write_compact_dataframe_to_csv(df_dupfree)
METADATA: datasetId, archiveType, dataSetName, duplicateDetails, geo_meanElev, geo_meanLat, geo_meanLon, geo_siteName, interpretation_direction, interpretation_seasonality, interpretation_variable, interpretation_variableDetail, originalDataURL, originalDatabase, paleoData_notes, paleoData_proxy, paleoData_sensorSpecies, paleoData_units, paleoData_variableName, yearUnits Saved to /home/jupyter-lluecke/dod2k_v2.0/dod2k/data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact_%s.csv
# load dataframe
print(utf.load_compact_dataframe_from_csv(df_dupfree.name).info())
<class 'pandas.core.frame.DataFrame'> RangeIndex: 4967 entries, 0 to 4966 Data columns (total 22 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 archiveType 4967 non-null object 1 dataSetName 4967 non-null object 2 datasetId 4967 non-null object 3 duplicateDetails 4967 non-null object 4 geo_meanElev 4884 non-null float32 5 geo_meanLat 4967 non-null float32 6 geo_meanLon 4967 non-null float32 7 geo_siteName 4967 non-null object 8 interpretation_direction 4967 non-null object 9 interpretation_seasonality 4967 non-null object 10 interpretation_variable 4967 non-null object 11 interpretation_variableDetail 4967 non-null object 12 originalDataURL 4967 non-null object 13 originalDatabase 4967 non-null object 14 paleoData_notes 4967 non-null object 15 paleoData_proxy 4967 non-null object 16 paleoData_sensorSpecies 4967 non-null object 17 paleoData_units 4967 non-null object 18 paleoData_values 4967 non-null object 19 paleoData_variableName 4967 non-null object 20 year 4967 non-null object 21 yearUnits 4967 non-null object dtypes: float32(3), object(19) memory usage: 795.6+ KB None
# write header with operator information as README txt file
file = open(f'data/{df_dupfree.name}/{df_dupfree.name}_dupfree_README.txt', 'w')
for line in header:
file.write(line+'\n')
file.close()
fn = utf.find(df_dupfree.name, f'data/{df_dupfree.name}')
print(fn)
if fn != []:
print('----------------------------------------------------')
print('Sucessfully finished the duplicate finalising process!'.upper())
print('----------------------------------------------------')
print('Saved the final output files in:')
print()
for ff in fn:
print(' '+os.getcwd()+'/%s.'%ff)
print()
print('The duplicate detection process is now finished and the duplicate free database is ready to use!')
else:
print('!!!!!!!!!!!!WARNING!!!!!!!!!!!')
print(f'Final output file is missing at data/{df_dupfree.name}.')
print()
print('Please re-run the notebook to complete duplicate finalising process.')
['data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact_year.csv', 'data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact_metadata.csv', 'data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_dupfree_README.txt', 'data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact_paleoData_values.csv', 'data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact.pkl'] ---------------------------------------------------- SUCESSFULLY FINISHED THE DUPLICATE FINALISING PROCESS! ---------------------------------------------------- Saved the final output files in: /home/jupyter-lluecke/dod2k_v2.0/dod2k/data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact_year.csv. /home/jupyter-lluecke/dod2k_v2.0/dod2k/data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact_metadata.csv. /home/jupyter-lluecke/dod2k_v2.0/dod2k/data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_dupfree_README.txt. /home/jupyter-lluecke/dod2k_v2.0/dod2k/data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact_paleoData_values.csv. /home/jupyter-lluecke/dod2k_v2.0/dod2k/data/all_merged_LL_25-12-11_dupfree/all_merged_LL_25-12-11_dupfree_compact.pkl. The duplicate detection process is now finished and the duplicate free database is ready to use!
Summary and summary plots of datasets¶
Import plotting libraries
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec as GS
import cartopy.crs as ccrs
import cartopy.feature as cfeature
from dod2k_utilities import ut_plot as uplt # contains plotting functions
#%% print some info about the data
db_types = df_dupfree_rmv['originalDatabase'].unique()
col = uplt.get_colours(range(len(db_types)), 'tab10', 0, len(db_types))
#col = ['tab:blue','tab:green', 'tab:grey', 'tab:pink', 'tab:orange']
counts = []
ticks = []
colours = []
for ii, db in enumerate(db_types):
cc = df_dupfree_rmv['originalDatabase'][(df_dupfree_rmv['originalDatabase']==db)].count()
counts += [cc]
ticks += [db.split('(Ocn_103')[0]]
colours += [col[ii]]
# plot a bar chart of the number of proxy types included in the dataset
fig = plt.figure(figsize=(8,4), dpi=200)
ax = plt.gca()
plt.bar(range(len(ticks)), counts, color=colours)
plt.xlabel('database')
plt.ylabel('count')
ax.set_xticks(range(len(ticks)), ticks, rotation=45, ha='right')
#ax.set_xticklabels(proxy_types, rotation=45, ha='right')
plt.title('original database')
plt.show()
fig.tight_layout()
utf.figsave(fig, 'SF_removed_recs_barchart_databases', add='%s/'%df_dupfree.name)
saved figure in /figs/all_merged_LL_25-12-11_dupfree//SF_removed_recs_barchart_databases.pdf
#%% print some info about the data
proxy_types = df_dupfree_rmv['paleoData_proxy'].unique()
archive_types = df_dupfree_rmv['archiveType'].unique()
print(proxy_types)
print(archive_types)
col = uplt.get_colours(range(0,len(archive_types)), 'Accent', -1, len(archive_types))
counts = []
ticks = []
colours = []
for ii, at in enumerate(archive_types):
proxy_types = df_dupfree_rmv['paleoData_proxy'][df_dupfree_rmv['archiveType']==at].unique()
for pt in proxy_types:
cc = df_dupfree_rmv['paleoData_proxy'][(df_dupfree_rmv['paleoData_proxy']==pt)&(df_dupfree_rmv['archiveType']==at)].count()
# print('%25s'%pt+': '+str(cc))
counts += [cc]
ticks += [at+': '+pt]
colours += [col[ii]]
['ring width' 'residual chronology' 'ARSTAN' 'maximum latewood density' 'reflectance' 'd18O' 'd13C' 'Sr/Ca' 'Mg/Ca' 'temperature' 'historical' 'varve thickness' 'ice melt' 'alkenone' 'chironomid' 'Uk37' 'borehole' 'pollen' 'dinocyst' 'count' 'concentration' 'chrysophyte assemblage' 'dD' 'calcification rate' 'foraminifera' 'dust' 'chloride' 'sulfate' 'nitrate' 'thickness' 'TEX86' 'effective precipitation' 'diatom' 'multiproxy' 'humidification index' 'accumulation rate' 'sodium' 'growth rate'] ['Wood' 'Coral' 'LakeSediment' 'MarineSediment' 'Documents' 'GlacierIce' 'Borehole' 'Sclerosponge' 'Speleothem' 'Other' 'GroundIce' 'MolluskShell' 'speleothem']
# plot a bar chart of the number of proxy types included in the dataset
fig = plt.figure(figsize=(12, 6), dpi=150)
ax = plt.gca()
plt.bar(range(len(ticks)), counts, color=colours)
plt.xlabel('proxy type')
plt.ylabel('count')
ax.set_xticks(range(len(ticks)), ticks, rotation=45, ha='right')
#ax.set_xticklabels(proxy_types, rotation=45, ha='right')
ax.set_yscale('log')
plt.title('removed proxy types')
plt.show()
fig.tight_layout()
utf.figsave(fig, 'SF_removed_recs_barchart_proxytypes', add='%s/'%df_dupfree.name)
saved figure in /figs/all_merged_LL_25-12-11_dupfree//SF_removed_recs_barchart_proxytypes.pdf
#%% plot the spatial distribution of the removeed records
proxy_lats = df_dupfree_rmv['geo_meanLat'].values
proxy_lons = df_dupfree_rmv['geo_meanLon'].values
# plots the map
fig = plt.figure(figsize=(10, 5), dpi=200)
grid = GS(1, 3)
ax = plt.subplot(grid[:, -2:], projection=ccrs.Robinson()) # create axis with Robinson projection of globe
ax.stock_img()
ax.add_feature(cfeature.LAND) # adds land features
ax.coastlines() # adds coastline features
mt = 'ov^<>pP*XDd'*10 # generates string of marker types
archive_marker = {aa: mm for aa, mm in zip(archive_types, mt)} # attributes marker type to each archive type
archive_colour = {aa: cc for aa, cc in zip(archive_types, col)}
# loop through the data to generate a scatter plot of each data record:
# 1st loop: go through archive types individually (determines marker type)
# 2nd loop: through paleo proxy types attributed to the specific archive, which is colour coded
for jj, at in enumerate(archive_types):
arch_mask = df_dupfree_rmv['archiveType']==at
arch_proxy_types = np.unique(df_dupfree_rmv['paleoData_proxy'][arch_mask])
for ii, pt in enumerate(arch_proxy_types):
pt_mask = df_dupfree_rmv['paleoData_proxy']==pt
at_mask = df_dupfree_rmv['archiveType']==at
plt.scatter(proxy_lons[pt_mask&at_mask], proxy_lats[pt_mask&at_mask],
transform=ccrs.PlateCarree(), zorder=999,
marker=mt[ii], color=archive_colour[at],
label=at+': '+pt+' ($n=%d$)'% df_dupfree_rmv['paleoData_proxy'][(df_dupfree_rmv['paleoData_proxy']==pt)&(df_dupfree_rmv['archiveType']==at)].count(),
lw=.5, ec='k')
plt.title('removed proxy types')
plt.legend(bbox_to_anchor=(0.03,1.1), ncol=2, fontsize=9, framealpha=0)
grid.tight_layout(fig)
utf.figsave(fig, 'SF_removed_spatial', add='%s/'%df_dupfree.name)
/tmp/ipykernel_1267500/3804216409.py:36: UserWarning: Tight layout not applied. tight_layout cannot make axes width small enough to accommodate all axes decorations grid.tight_layout(fig)
saved figure in /figs/all_merged_LL_25-12-11_dupfree//SF_removed_spatial.pdf